Skip to Content
We are live but in Staging 🎉
WorkflowsWorkflow 5: Incident Debugging Playbook

Workflow 5: Incident Debugging Playbook

Last validated: 2026-05-14

Goal

Diagnose failed or stuck threads quickly and collect enough evidence for deterministic fixes.

Trigger Signals

  • thread run shows failure/cancelled status
  • output is missing or truncated unexpectedly
  • thread remains pending/running without progress

A) CLI Fast Triage

# 1) Thread summary dodil scriptum thread get thr_fail_01 # 2) Full result payloads to disk dodil scriptum thread result thr_fail_01 --save ./debug/thr_fail_01 # 3) Step-level detail stream dodil scriptum thread steps thr_fail_01 --detail # 4) Check service-level health dodil scriptum health dodil scriptum doctor

What these commands provide:

  • status/error and current step metadata
  • output and summary artifacts for offline analysis
  • per-step diagnostics (including execution IDs and logs URL when available)
  • baseline API connectivity/auth validation

B) gRPC Deep-Dive Path

# Thread status grpcurl \ -H "authorization: Bearer $DODIL_TOKEN" \ -H "x-organization-id: $SCRIPTUM_ORG_ID" \ -d '{"thread_id":"thr_fail_01"}' \ rpc.dev.dodil.io:443 dodil.scriptum.v1.ScriptumService/GetThread # One-step focus grpcurl \ -H "authorization: Bearer $DODIL_TOKEN" \ -H "x-organization-id: $SCRIPTUM_ORG_ID" \ -d '{"thread_id":"thr_fail_01","step_path":"1.3.0"}' \ rpc.dev.dodil.io:443 dodil.scriptum.v1.ScriptumService/GetThreadResult # Stream all step chunks for full evidence grpcurl \ -H "authorization: Bearer $DODIL_TOKEN" \ -H "x-organization-id: $SCRIPTUM_ORG_ID" \ -d '{"thread_id":"thr_fail_01","max_chunk_bytes":1048576}' \ rpc.dev.dodil.io:443 dodil.scriptum.v1.ScriptumService/StreamStepResult

C) HTTP Operational Checks

curl -sS https://api.dev.dodil.io:443/health curl -sS https://api.dev.dodil.io:443/ready curl -sS https://api.dev.dodil.io:443/metrics | head -n 40

If gateway is deployed, compare API-level behavior:

curl -sS "https://api.dev.dodil.io:443/v1/scriptum/threads/thr_fail_01" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "x-organization-id: $SCRIPTUM_ORG_ID"

D) Decision Tree

  1. If thread is stuck pending:
    • Verify worker health and queue consumption.
  2. If thread failed quickly:
    • Inspect step diagnostics and env resolution.
  3. If output is truncated or incomplete:
    • Switch from unary result to stream endpoints.
  4. If pause cannot continue:
    • Validate resume payload type/options/retry/expiry.

E) Evidence to Collect Before Fix

  • Thread ID and script version
  • Input payload used
  • Result summary and complete output artifacts
  • Step-level diagnostics and failing step path
  • API/worker health status at incident time

F) Stabilization Loop

  1. Patch draft.
  2. Compile and draft-test.
  3. Publish fixed version.
  4. Re-run representative failing input.
  5. Close incident with attached artifacts and version IDs.