Replay & retry
Goal: recover from ingest failures and re-run objects after pipeline changes — without dropping the rest of your ingest stream.
Why this matters: ingest pipelines run async over big object volumes. Some jobs fail transiently (network blip, model timeout, transient backend slowness). Some fail permanently (corrupt PDF, malformed schema). K3 retries the first kind automatically; the second kind needs your action. This recipe walks the full diagnose-and-recover loop.
Shape:
list FAILED jobs ──► read .error / .error_details
│
▼
decide: bulk or surgical?
│
┌────────────────────────┼────────────────────────┐
▼ ▼ ▼
all failures in just one object full re-scan of
one source / rule (TriggerIngest) source (full_sync)
(TriggerIngestion then re-dispatch
retry_failed=true)Prerequisites
- A bucket with rules + pipelines already wired (see PDF → Vector or Documents → Warehouse)
- Some ingest history — at least a handful of jobs across
COMPLETED/FAILED/PARTIAL(you can force a failure by uploading a corrupt PDF or pointing a rule at an unsupported MIME)
We’ll use kb-prod as the bucket throughout.
1. List FAILED jobs in the bucket
# CLI: lists everything; filter client-side
dodil k3 ingest jobs --bucket kb-prod -o json \
| jq '.jobs[] | select(.status == "INGEST_STATUS_FAILED") | {jobId, object: .object.key, pipelineName, error, updatedAt}'
# API: status filter is server-side (more efficient for large buckets)
curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?status_filter=INGEST_STATUS_FAILED&page_size=100" \
-H "Authorization: Bearer $DODIL_TOKEN" \
| jq '.jobs[] | {jobId, object: .object.key, pipelineName, ruleId, error}'Want the PARTIAL bucket (some output landed, some didn’t)?
curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?status_filter=INGEST_STATUS_PARTIAL" \
-H "Authorization: Bearer $DODIL_TOKEN" \
| jq '.jobs[] | {jobId, object: .object.key, embeddingsCreated, embeddingsWritten, error}'PARTIAL is common in vector pipelines when some chunks embedded successfully but embeddingsWritten < embeddingsCreated — the collection write phase didn’t land everything.
2. Read the failure detail
JOB_ID="<a failed job_id from step 1>"
curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs/$JOB_ID" \
-H "Authorization: Bearer $DODIL_TOKEN" \
| jq '{status, error, errorDetails, threadId, batchesReceived, embeddingsCreated, embeddingsWritten, rowsWritten}'Three fields matter most for diagnosis:
| Field | Meaning |
|---|---|
error | One-line summary. For retried failures, includes "attempt N/M: <last error>". |
errorDetails | Long-form context if available (stack trace, model response, partial output). |
threadId | The Scriptum thread that ran — useful cross-referencing in Scriptum’s own observability. |
Common failure shapes you’ll see — first-pass triage:
error snippet | Usually means | Action |
|---|---|---|
attempt 3/3: timeout | Model / extraction step timed out repeatedly | Permanent — retry-after-fix or skip |
corrupt PDF / unsupported encoding | Object can’t be parsed | Permanent — fix the source file or exclude |
unsupported content_type: ... | MIME doesn’t match what the template accepts | Add to rule’s exclude_mime_types, or use a different template |
validation: missing required option ... | Pipeline options don’t satisfy the template’s ScriptContract | Update the pipeline’s options |
destination write failed | Vector collection or warehouse table is misconfigured | Fix the destination, then replay |
attempt 1/N: <transient> and status is RETRYING | Currently retrying — don’t touch | Wait it out; it’ll move to PROCESSING then a terminal state |
3. How RETRYING works
If you see jobs in INGEST_STATUS_RETRYING, K3 has already detected a transient failure and is retrying automatically. You don’t need to do anything — wait for the terminal state. Status flow:
PENDING ─► PROCESSING ─► COMPLETED
│
▼ transient error
RETRYING ─► PROCESSING ─► COMPLETED (success path)
│
▼ same / different transient error
RETRYING ─► PROCESSING ─► COMPLETED (eventually succeeds)
╲
╲ max attempts reached
╲► FAILEDThe error field on a RETRYING job reads attempt N/M: <last error> — useful for watching progress without polling internals. After max attempts, RETRYING → FAILED with the last error preserved.
4. Replay strategies — pick the right scope
K3 gives you three replay scopes. Pick by how much you want to re-dispatch:
A. Single object — surgical
For “I just want this one job to run again” (after fixing the source file, or to test a pipeline change against a known case):
# Via TriggerIngest — re-runs the matching rule's pipeline
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" }
}'
# Force a specific pipeline (override rule resolution)
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" },
"pipelineId": "pipe_a1b2..."
}'
# Per-event option overlay (one-off parameter sweep without modifying the pipeline)
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" },
"pipelineId": "pipe_a1b2...",
"options": { "chunk_size": "2000" }
}'The CLI doesn’t expose TriggerIngest directly today — use the API. See Jobs — API Reference.
B. All failed objects in a source — bulk replay
For “I fixed the pipeline, now re-run everything that failed”:
# Replay every FAILED / PARTIAL object on this source through their matching rules
dodil k3 ingest trigger --bucket kb-prod \
--source "$SOURCE_ID" \
--retry-failed
# Or scope to one rule's failures
dodil k3 ingest trigger --bucket kb-prod \
--source "$SOURCE_ID" \
--rule "$RULE_ID" \
--retry-failedThe response includes a dispatched count — how many objects K3 queued for re-ingest:
{
"accepted": true,
"message": "Replay started for 42 failed objects",
"dispatched": 42
}C. Full re-discover (after rules / pipeline options change)
For “the rule’s match set changed (you tightened or loosened the include pattern), or the pipeline now produces different output and you want to re-run everything that would match today”:
# Discover from scratch — ignores the source's etag checkpoint
dodil k3 ingest trigger-discovery --bucket kb-prod \
--source "$SOURCE_ID" \
--full-sync
# Then re-dispatch any pending objects (you can combine with --retry-failed)
dodil k3 ingest trigger --bucket kb-prod --source "$SOURCE_ID" --retry-failedFor external (Preview) sources this is the standard backfill pattern; for the internal S3 source it’s effectively a “scan and re-fire rules on everything in the bucket.”
5. Verify the replay landed
After kicking off a replay (B or C), watch the same ListIngestJobs query you used in step 1 — the failed jobs should move to PROCESSING then COMPLETED:
# Tail the rule's jobs every 5s until none are pending/processing/retrying
while true; do
pending=$(curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?rule_id=$RULE_ID" \
-H "Authorization: Bearer $DODIL_TOKEN" \
| jq '[.jobs[] | select(.status | IN("INGEST_STATUS_PENDING", "INGEST_STATUS_PROCESSING", "INGEST_STATUS_RETRYING"))] | length')
echo "in-flight: $pending"
[ "$pending" = "0" ] && break
sleep 5
doneThen summarize the new terminal-state distribution:
curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?rule_id=$RULE_ID&page_size=500" \
-H "Authorization: Bearer $DODIL_TOKEN" \
| jq '[.jobs[].status] | group_by(.) | map({status: .[0], count: length})'You’re aiming for INGEST_STATUS_COMPLETED to dominate. Anything stuck in FAILED after a replay needs another round of diagnosis.
When to not retry
A few classes of failure where retry just burns more CPU:
- Object permanently corrupt / unsupported — fix or remove the source object, then re-upload. Retrying without changing the input gets the same error.
- Template mismatch — the template requires options you didn’t provide (or the wrong type). Fix
pipeline.optionsfirst, then replay. - Destination is gone — your vector collection or warehouse table was deleted. Recreate / re-bind the pipeline before replaying.
- Quota exhausted — replay will hit the same quota wall. Resolve quotas in the Storage admin first.
A useful gate: if error contains “attempt N/N” (i.e. K3 already retried up to max), assume the failure is permanent until you change something. Just re-firing the same job will land in FAILED again.
Surgical re-run with a pipeline override
When you’re debugging “did my new pipeline options fix this?” without touching the production pipeline:
# Test the same object against a different (test) pipeline
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" },
"pipelineId": "pipe_test_a1b2..."
}'
# Or test new options without changing the pipeline at all
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" },
"pipelineId": "pipe_a1b2...",
"options": { "chunk_size": "2000", "chunk_overlap": "400" }
}'The options field on the request is a per-event overlay — it doesn’t modify the pipeline. Spawn N jobs with N different option sets to A/B test before committing changes via pipeline update.
Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
--retry-failed returns dispatched: 0 | No FAILED/PARTIAL jobs match the source/rule filter | Run the ListIngestJobs query in step 1 first to confirm there’s work to retry |
| Replay completes but the new jobs immediately FAIL with the same error | You didn’t fix the underlying issue (template options, source data, destination) | Diagnose first via error_details, then retry — see “When to not retry” |
Some jobs go RETRYING → FAILED, others RETRYING → COMPLETED, randomly | A flaky upstream dependency (e.g. embedding model) | K3 retried for you; if the flake rate is high, lower your batch size or contact platform team — don’t paper over it with infinite retries |
| Old object suddenly re-fires after a rule change | trigger-discovery --full-sync ignores the etag checkpoint and re-discovers everything matching | That’s the intended semantics; use --full-sync only when you want this |
TriggerIngest returns 404 for the object | Object key is wrong or already deleted | Re-check object.bucket and object.key — they’re path-style, case-sensitive |
See also
- PDF → Vector — the canonical RAG ingest recipe you might be retrying from
- Documents → Warehouse — the warehouse-side recipe
- Sync — API Reference —
TriggerDiscovery,TriggerIngestion,GetSyncStatus - Jobs — API Reference —
TriggerIngest,GetIngestStatus,ListIngestJobs - Core Concepts → IngestJob — the full status enum + counters