Workflow 5: Incident Debugging Playbook
Last validated: 2026-05-14
Goal
Diagnose failed or stuck threads quickly and collect enough evidence for deterministic fixes.
Trigger Signals
thread runshows failure/cancelled status- output is missing or truncated unexpectedly
- thread remains pending/running without progress
A) CLI Fast Triage
# 1) Thread summary
dodil scriptum thread get thr_fail_01
# 2) Full result payloads to disk
dodil scriptum thread result thr_fail_01 --save ./debug/thr_fail_01
# 3) Step-level detail stream
dodil scriptum thread steps thr_fail_01 --detail
# 4) Check service-level health
dodil scriptum health
dodil scriptum doctorWhat these commands provide:
- status/error and current step metadata
- output and summary artifacts for offline analysis
- per-step diagnostics (including execution IDs and logs URL when available)
- baseline API connectivity/auth validation
B) gRPC Deep-Dive Path
# Thread status
grpcurl \
-H "authorization: Bearer $DODIL_TOKEN" \
-H "x-organization-id: $SCRIPTUM_ORG_ID" \
-d '{"thread_id":"thr_fail_01"}' \
rpc.dev.dodil.io:443 dodil.scriptum.v1.ScriptumService/GetThread
# One-step focus
grpcurl \
-H "authorization: Bearer $DODIL_TOKEN" \
-H "x-organization-id: $SCRIPTUM_ORG_ID" \
-d '{"thread_id":"thr_fail_01","step_path":"1.3.0"}' \
rpc.dev.dodil.io:443 dodil.scriptum.v1.ScriptumService/GetThreadResult
# Stream all step chunks for full evidence
grpcurl \
-H "authorization: Bearer $DODIL_TOKEN" \
-H "x-organization-id: $SCRIPTUM_ORG_ID" \
-d '{"thread_id":"thr_fail_01","max_chunk_bytes":1048576}' \
rpc.dev.dodil.io:443 dodil.scriptum.v1.ScriptumService/StreamStepResultC) HTTP Operational Checks
curl -sS https://api.dev.dodil.io:443/health
curl -sS https://api.dev.dodil.io:443/ready
curl -sS https://api.dev.dodil.io:443/metrics | head -n 40If gateway is deployed, compare API-level behavior:
curl -sS "https://api.dev.dodil.io:443/v1/scriptum/threads/thr_fail_01" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "x-organization-id: $SCRIPTUM_ORG_ID"D) Decision Tree
- If thread is stuck pending:
- Verify worker health and queue consumption.
- If thread failed quickly:
- Inspect step diagnostics and env resolution.
- If output is truncated or incomplete:
- Switch from unary result to stream endpoints.
- If pause cannot continue:
- Validate resume payload type/options/retry/expiry.
E) Evidence to Collect Before Fix
- Thread ID and script version
- Input payload used
- Result summary and complete output artifacts
- Step-level diagnostics and failing step path
- API/worker health status at incident time
F) Stabilization Loop
- Patch draft.
- Compile and draft-test.
- Publish fixed version.
- Re-run representative failing input.
- Close incident with attached artifacts and version IDs.