Replay & retry

Goal: recover from ingest failures and re-run objects after pipeline changes — without dropping the rest of your ingest stream.

Why this matters: ingest pipelines run async over big object volumes. Some jobs fail transiently (network blip, model timeout, transient backend slowness). Some fail permanently (corrupt PDF, malformed schema). K3 retries the first kind automatically; the second kind needs your action. This recipe walks the full diagnose-and-recover loop.

Shape:


   list FAILED jobs   ──►   read .error / .error_details
                                       │
                                       ▼
                              decide: bulk or surgical?
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
       all failures in       just one object              full re-scan of
       one source / rule    (TriggerIngest)               source (full_sync)
       (TriggerIngestion                                   then re-dispatch
        retry_failed=true)

Prerequisites

A bucket with rules + pipelines already wired (see PDF → Vector or Documents → Warehouse)
Some ingest history — at least a handful of jobs across COMPLETED / FAILED / PARTIAL (you can force a failure by uploading a corrupt PDF or pointing a rule at an unsupported MIME)

We’ll use kb-prod as the bucket throughout.

1. List FAILED jobs in the bucket


# CLI: lists everything; filter client-side
dodil k3 ingest jobs --bucket kb-prod -o json \
  | jq '.jobs[] | select(.status == "INGEST_STATUS_FAILED") | {jobId, object: .object.key, pipelineName, error, updatedAt}'
 
# API: status filter is server-side (more efficient for large buckets)
curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?status_filter=INGEST_STATUS_FAILED&page_size=100" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  | jq '.jobs[] | {jobId, object: .object.key, pipelineName, ruleId, error}'

Want the PARTIAL bucket (some output landed, some didn’t)?


curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?status_filter=INGEST_STATUS_PARTIAL" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  | jq '.jobs[] | {jobId, object: .object.key, embeddingsCreated, embeddingsWritten, error}'

PARTIAL is common in vector pipelines when some chunks embedded successfully but embeddingsWritten < embeddingsCreated — the collection write phase didn’t land everything.

2. Read the failure detail


JOB_ID="<a failed job_id from step 1>"
 
curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs/$JOB_ID" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  | jq '{status, error, errorDetails, threadId, batchesReceived, embeddingsCreated, embeddingsWritten, rowsWritten}'

Three fields matter most for diagnosis:

Field	Meaning
`error`	One-line summary. For retried failures, includes `"attempt N/M: <last error>"`.
`errorDetails`	Long-form context if available (stack trace, model response, partial output).
`threadId`	The Scriptum thread that ran — useful cross-referencing in Scriptum’s own observability.

Common failure shapes you’ll see — first-pass triage:

`error` snippet	Usually means	Action
`attempt 3/3: timeout`	Model / extraction step timed out repeatedly	Permanent — retry-after-fix or skip
`corrupt PDF` / `unsupported encoding`	Object can’t be parsed	Permanent — fix the source file or exclude
`unsupported content_type: ...`	MIME doesn’t match what the template accepts	Add to rule’s `exclude_mime_types`, or use a different template
`validation: missing required option ...`	Pipeline options don’t satisfy the template’s `ScriptContract`	Update the pipeline’s options
`destination write failed`	Vector collection or warehouse table is misconfigured	Fix the destination, then replay
`attempt 1/N: <transient>` and status is `RETRYING`	Currently retrying — don’t touch	Wait it out; it’ll move to `PROCESSING` then a terminal state

3. How `RETRYING` works

If you see jobs in INGEST_STATUS_RETRYING, K3 has already detected a transient failure and is retrying automatically. You don’t need to do anything — wait for the terminal state. Status flow:


PENDING ─► PROCESSING ─► COMPLETED
                │
                ▼  transient error
            RETRYING ─► PROCESSING ─► COMPLETED   (success path)
                │
                ▼  same / different transient error
            RETRYING ─► PROCESSING ─► COMPLETED   (eventually succeeds)
                                  ╲
                                   ╲  max attempts reached
                                    ╲►  FAILED

The error field on a RETRYING job reads attempt N/M: <last error> — useful for watching progress without polling internals. After max attempts, RETRYING → FAILED with the last error preserved.

4. Replay strategies — pick the right scope

K3 gives you three replay scopes. Pick by how much you want to re-dispatch:

A. Single object — surgical

For “I just want this one job to run again” (after fixing the source file, or to test a pipeline change against a known case):


# Via TriggerIngest — re-runs the matching rule's pipeline
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" }
  }'
 
# Force a specific pipeline (override rule resolution)
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" },
    "pipelineId": "pipe_a1b2..."
  }'
 
# Per-event option overlay (one-off parameter sweep without modifying the pipeline)
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" },
    "pipelineId": "pipe_a1b2...",
    "options": { "chunk_size": "2000" }
  }'

The CLI doesn’t expose TriggerIngest directly today — use the API. See Jobs — API Reference.

B. All failed objects in a source — bulk replay

For “I fixed the pipeline, now re-run everything that failed”:


# Replay every FAILED / PARTIAL object on this source through their matching rules
dodil k3 ingest trigger --bucket kb-prod \
  --source "$SOURCE_ID" \
  --retry-failed
 
# Or scope to one rule's failures
dodil k3 ingest trigger --bucket kb-prod \
  --source "$SOURCE_ID" \
  --rule "$RULE_ID" \
  --retry-failed

The response includes a dispatched count — how many objects K3 queued for re-ingest:


{
  "accepted": true,
  "message": "Replay started for 42 failed objects",
  "dispatched": 42
}

C. Full re-discover (after rules / pipeline options change)

For “the rule’s match set changed (you tightened or loosened the include pattern), or the pipeline now produces different output and you want to re-run everything that would match today”:


# Discover from scratch — ignores the source's etag checkpoint
dodil k3 ingest trigger-discovery --bucket kb-prod \
  --source "$SOURCE_ID" \
  --full-sync
 
# Then re-dispatch any pending objects (you can combine with --retry-failed)
dodil k3 ingest trigger --bucket kb-prod --source "$SOURCE_ID" --retry-failed

For external (Preview) sources this is the standard backfill pattern; for the internal S3 source it’s effectively a “scan and re-fire rules on everything in the bucket.”

5. Verify the replay landed

After kicking off a replay (B or C), watch the same ListIngestJobs query you used in step 1 — the failed jobs should move to PROCESSING then COMPLETED:


# Tail the rule's jobs every 5s until none are pending/processing/retrying
while true; do
  pending=$(curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?rule_id=$RULE_ID" \
    -H "Authorization: Bearer $DODIL_TOKEN" \
    | jq '[.jobs[] | select(.status | IN("INGEST_STATUS_PENDING", "INGEST_STATUS_PROCESSING", "INGEST_STATUS_RETRYING"))] | length')
  echo "in-flight: $pending"
  [ "$pending" = "0" ] && break
  sleep 5
done

Then summarize the new terminal-state distribution:


curl -sS "https://k3.dev.dodil.io/kb-prod/ingest/jobs?rule_id=$RULE_ID&page_size=500" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  | jq '[.jobs[].status] | group_by(.) | map({status: .[0], count: length})'

You’re aiming for INGEST_STATUS_COMPLETED to dominate. Anything stuck in FAILED after a replay needs another round of diagnosis.

When to not retry

A few classes of failure where retry just burns more CPU:

Object permanently corrupt / unsupported — fix or remove the source object, then re-upload. Retrying without changing the input gets the same error.
Template mismatch — the template requires options you didn’t provide (or the wrong type). Fix pipeline.options first, then replay.
Destination is gone — your vector collection or warehouse table was deleted. Recreate / re-bind the pipeline before replaying.
Quota exhausted — replay will hit the same quota wall. Resolve quotas in the Storage admin first.

A useful gate: if error contains “attempt N/N” (i.e. K3 already retried up to max), assume the failure is permanent until you change something. Just re-firing the same job will land in FAILED again.

Surgical re-run with a pipeline override

When you’re debugging “did my new pipeline options fix this?” without touching the production pipeline:


# Test the same object against a different (test) pipeline
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" },
    "pipelineId": "pipe_test_a1b2..."
  }'
 
# Or test new options without changing the pipeline at all
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/ingest" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "object": { "bucket": "kb-prod", "key": "intake/contracts/acme-2026.pdf" },
    "pipelineId": "pipe_a1b2...",
    "options": { "chunk_size": "2000", "chunk_overlap": "400" }
  }'

The options field on the request is a per-event overlay — it doesn’t modify the pipeline. Spawn N jobs with N different option sets to A/B test before committing changes via pipeline update.

Common gotchas

Symptom	Cause	Fix
`--retry-failed` returns `dispatched: 0`	No FAILED/PARTIAL jobs match the source/rule filter	Run the `ListIngestJobs` query in step 1 first to confirm there’s work to retry
Replay completes but the new jobs immediately FAIL with the same error	You didn’t fix the underlying issue (template options, source data, destination)	Diagnose first via `error_details`, then retry — see “When to not retry”
Some jobs go `RETRYING → FAILED`, others `RETRYING → COMPLETED`, randomly	A flaky upstream dependency (e.g. embedding model)	K3 retried for you; if the flake rate is high, lower your batch size or contact platform team — don’t paper over it with infinite retries
Old object suddenly re-fires after a rule change	`trigger-discovery --full-sync` ignores the etag checkpoint and re-discovers everything matching	That’s the intended semantics; use `--full-sync` only when you want this
`TriggerIngest` returns 404 for the object	Object key is wrong or already deleted	Re-check `object.bucket` and `object.key` — they’re path-style, case-sensitive