Skip to Content
We are live but in Staging 🎉
PipelinesRecipesReplay & Retry

Replay & retry

Goal: recover from ingest failures and re-run objects after pipeline changes — without dropping the rest of your ingest stream.

Why this matters: ingest pipelines run async over big object volumes. Some jobs fail transiently (network blip, model timeout, transient backend slowness). Some fail permanently (corrupt PDF, malformed schema). K3 retries the first kind automatically; the second kind needs your action. This recipe walks the full diagnose-and-recover loop.

Shape:

list FAILED jobs ──► read .error / .error_details decide: bulk or surgical? ┌────────────────────────┼────────────────────────┐ ▼ ▼ ▼ all failures in just one object full re-scan of one source / rule (TriggerIngest) source (full_sync) (TriggerIngestion then re-dispatch retry_failed=true)

Prerequisites

  • A bucket with rules + pipelines already wired (see PDF → Vector or Documents → Warehouse)
  • Some ingest history — at least a handful of jobs across COMPLETED / FAILED / PARTIAL (you can force a failure by uploading a corrupt PDF or pointing a rule at an unsupported MIME)

We’ll use kb-prod as the bucket throughout.

1. List FAILED jobs in the bucket

# CLI: lists everything; filter client-side dodil k3 ingest jobs --bucket kb-prod -o json \ | jq '.jobs[] | select(.status == "INGEST_STATUS_FAILED") | {jobId, object: .object.key, pipelineName, error, updatedAt}' # API: status filter is server-side (more efficient for large buckets) curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?status_filter=INGEST_STATUS_FAILED&page_size=100" \ -H "Authorization: Bearer $DODIL_TOKEN" \ | jq '.jobs[] | {jobId, object: .object.key, pipelineName, ruleId, error}'

Want the PARTIAL bucket (some output landed, some didn’t)?

curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?status_filter=INGEST_STATUS_PARTIAL" \ -H "Authorization: Bearer $DODIL_TOKEN" \ | jq '.jobs[] | {jobId, object: .object.key, embeddingsCreated, embeddingsWritten, error}'

PARTIAL is common in vector pipelines when some chunks embedded successfully but embeddingsWritten < embeddingsCreated — the collection write phase didn’t land everything.

2. Read the failure detail

JOB_ID="<a failed job_id from step 1>" curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs/$JOB_ID" \ -H "Authorization: Bearer $DODIL_TOKEN" \ | jq '{status, error, errorDetails, threadId, batchesReceived, embeddingsCreated, embeddingsWritten, rowsWritten}'

Three fields matter most for diagnosis:

FieldMeaning
errorOne-line summary. For retried failures, includes "attempt N/M: <last error>".
errorDetailsLong-form context if available (stack trace, model response, partial output).
threadIdThe Scriptum thread that ran — useful cross-referencing in Scriptum’s own observability.

Common failure shapes you’ll see — first-pass triage:

error snippetUsually meansAction
attempt 3/3: timeoutModel / extraction step timed out repeatedlyPermanent — retry-after-fix or skip
corrupt PDF / unsupported encodingObject can’t be parsedPermanent — fix the source file or exclude
unsupported content_type: ...MIME doesn’t match what the template acceptsAdd to rule’s exclude_mime_types, or use a different template
validation: missing required option ...Pipeline options don’t satisfy the template’s ScriptContractUpdate the pipeline’s options
destination write failedVector collection or warehouse table is misconfiguredFix the destination, then replay
attempt 1/N: <transient> and status is RETRYINGCurrently retrying — don’t touchWait it out; it’ll move to PROCESSING then a terminal state

3. How RETRYING works

If you see jobs in INGEST_STATUS_RETRYING, K3 has already detected a transient failure and is retrying automatically. You don’t need to do anything — wait for the terminal state. Status flow:

PENDING ─► PROCESSING ─► COMPLETED ▼ transient error RETRYING ─► PROCESSING ─► COMPLETED (success path) ▼ same / different transient error RETRYING ─► PROCESSING ─► COMPLETED (eventually succeeds) ╲ max attempts reached ╲► FAILED

The error field on a RETRYING job reads attempt N/M: <last error> — useful for watching progress without polling internals. After max attempts, RETRYING → FAILED with the last error preserved.

4. Replay strategies — pick the right scope

K3 gives you three replay scopes. Pick by how much you want to re-dispatch:

A. Single object — surgical

For “I just want this one job to run again” (after fixing the source file, or to test a pipeline change against a known case):

# Via TriggerIngest — re-runs the matching rule's pipeline curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" } }' # Force a specific pipeline (override rule resolution) curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" }, "pipelineId": "pipe_a1b2..." }' # Per-event option overlay (one-off parameter sweep without modifying the pipeline) curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" }, "pipelineId": "pipe_a1b2...", "options": { "chunk_size": "2000" } }'

The CLI doesn’t expose TriggerIngest directly today — use the API. See Jobs — API Reference.

B. All failed objects in a source — bulk replay

For “I fixed the pipeline, now re-run everything that failed”:

# Replay every FAILED / PARTIAL object on this source through their matching rules dodil k3 ingest trigger --bucket kb-prod \ --source "$SOURCE_ID" \ --retry-failed # Or scope to one rule's failures dodil k3 ingest trigger --bucket kb-prod \ --source "$SOURCE_ID" \ --rule "$RULE_ID" \ --retry-failed

The response includes a dispatched count — how many objects K3 queued for re-ingest:

{ "accepted": true, "message": "Replay started for 42 failed objects", "dispatched": 42 }

C. Full re-discover (after rules / pipeline options change)

For “the rule’s match set changed (you tightened or loosened the include pattern), or the pipeline now produces different output and you want to re-run everything that would match today”:

# Discover from scratch — ignores the source's etag checkpoint dodil k3 ingest trigger-discovery --bucket kb-prod \ --source "$SOURCE_ID" \ --full-sync # Then re-dispatch any pending objects (you can combine with --retry-failed) dodil k3 ingest trigger --bucket kb-prod --source "$SOURCE_ID" --retry-failed

For external (Preview) sources this is the standard backfill pattern; for the internal S3 source it’s effectively a “scan and re-fire rules on everything in the bucket.”

5. Verify the replay landed

After kicking off a replay (B or C), watch the same ListIngestJobs query you used in step 1 — the failed jobs should move to PROCESSING then COMPLETED:

# Tail the rule's jobs every 5s until none are pending/processing/retrying while true; do pending=$(curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?rule_id=$RULE_ID" \ -H "Authorization: Bearer $DODIL_TOKEN" \ | jq '[.jobs[] | select(.status | IN("INGEST_STATUS_PENDING", "INGEST_STATUS_PROCESSING", "INGEST_STATUS_RETRYING"))] | length') echo "in-flight: $pending" [ "$pending" = "0" ] && break sleep 5 done

Then summarize the new terminal-state distribution:

curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?rule_id=$RULE_ID&page_size=500" \ -H "Authorization: Bearer $DODIL_TOKEN" \ | jq '[.jobs[].status] | group_by(.) | map({status: .[0], count: length})'

You’re aiming for INGEST_STATUS_COMPLETED to dominate. Anything stuck in FAILED after a replay needs another round of diagnosis.

When to not retry

A few classes of failure where retry just burns more CPU:

  • Object permanently corrupt / unsupported — fix or remove the source object, then re-upload. Retrying without changing the input gets the same error.
  • Template mismatch — the template requires options you didn’t provide (or the wrong type). Fix pipeline.options first, then replay.
  • Destination is gone — your vector collection or warehouse table was deleted. Recreate / re-bind the pipeline before replaying.
  • Quota exhausted — replay will hit the same quota wall. Resolve quotas in the Storage admin first.

A useful gate: if error contains “attempt N/N” (i.e. K3 already retried up to max), assume the failure is permanent until you change something. Just re-firing the same job will land in FAILED again.

Surgical re-run with a pipeline override

When you’re debugging “did my new pipeline options fix this?” without touching the production pipeline:

# Test the same object against a different (test) pipeline curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" }, "pipelineId": "pipe_test_a1b2..." }' # Or test new options without changing the pipeline at all curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" }, "pipelineId": "pipe_a1b2...", "options": { "chunk_size": "2000", "chunk_overlap": "400" } }'

The options field on the request is a per-event overlay — it doesn’t modify the pipeline. Spawn N jobs with N different option sets to A/B test before committing changes via pipeline update.

Common gotchas

SymptomCauseFix
--retry-failed returns dispatched: 0No FAILED/PARTIAL jobs match the source/rule filterRun the ListIngestJobs query in step 1 first to confirm there’s work to retry
Replay completes but the new jobs immediately FAIL with the same errorYou didn’t fix the underlying issue (template options, source data, destination)Diagnose first via error_details, then retry — see “When to not retry”
Some jobs go RETRYING → FAILED, others RETRYING → COMPLETED, randomlyA flaky upstream dependency (e.g. embedding model)K3 retried for you; if the flake rate is high, lower your batch size or contact platform team — don’t paper over it with infinite retries
Old object suddenly re-fires after a rule changetrigger-discovery --full-sync ignores the etag checkpoint and re-discovers everything matchingThat’s the intended semantics; use --full-sync only when you want this
TriggerIngest returns 404 for the objectObject key is wrong or already deletedRe-check object.bucket and object.key — they’re path-style, case-sensitive

See also