Operations
Day-2 operations and incident triage. The shortest path from “something feels off” to “I know which surface to inspect.”
1. Service health + readiness
curl -sS "https://k3.dev.dodil.io/health"
curl -sS "https://k3.dev.dodil.io/healthz"
curl -sS "https://k3.dev.dodil.io/ready"
curl -sS "https://k3.dev.dodil.io/readyz"
curl -sS "https://k3.dev.dodil.io/metrics"All return 200 + a short body on a healthy K3. /metrics is Prometheus-shaped — scrape it for SLO dashboards. These five paths are the only ones that bypass auth — see Auth & Access → Unauthenticated endpoints.
2. Bucket + engine baseline
# Storage — list your org's buckets
dodil k3 bucket list
# Tables — confirm the tables engine is ACTIVE on the bucket
dodil k3 engine get --bucket "$BUCKET" -o json | jq '.status'
# expect: ENGINE_STATUS_ACTIVE (auto-enabled at bucket creation)
# Vector — confirm the vector engine is ACTIVE (you create this one explicitly)
dodil k3 vector store get --bucket "$BUCKET" -o json | jq '.status'
# expect: ENGINE_STATUS_ACTIVE (after `vector store create -m auto` finishes)If the Vector engine is missing entirely, vector store get returns an error — that means you haven’t configured the bucket for vector workloads yet. Run dodil k3 vector store create -b $BUCKET -m auto.
3. Pipelines health
# All rules in the bucket
dodil k3 ingest list --bucket "$BUCKET" -o json | jq '.rules[] | {name, enabled, pipelineId}'
# All jobs in the bucket — focus on FAILED + RETRYING
dodil k3 ingest jobs --bucket "$BUCKET" -o json \
| jq '[.jobs[] | select(.status | IN("INGEST_STATUS_FAILED", "INGEST_STATUS_RETRYING"))] |
group_by(.status) |
map({status: .[0].status, count: length})'
# Stuck-pending count (ingest backlog)
dodil k3 ingest jobs --bucket "$BUCKET" -o json \
| jq '[.jobs[] | select(.status == "INGEST_STATUS_PENDING")] | length'Things to look for:
- Repeated
FAILEDwith the sameerrorDetailsacross many objects → a template / pipeline misconfiguration. Replay after fixing — see Pipelines → Replay & Retry. - Stuck
PENDINGfor > 1 minute → check whether the worker is healthy via/metrics. - Persistent
RETRYINGwithattempt N/Mreaching N=M and flipping toFAILED→ transient failure exhausted retries; likely a real problem (upstream model timeout, missing credential, etc.). - Mismatched
pipelineIdbetween a rule and its pipeline → orphaned rule (the pipeline was deleted). Re-bind:dodil k3 ingest update <rule_id> -b $BUCKET --pipeline <new_pipeline_id>.
4. Search smoke check
dodil k3 search "health probe query" --bucket "$BUCKET" --collection "$COLLECTION" --top-k 5If you get zero results:
- Confirm the collection exists:
dodil k3 vector collection get $COLLECTION -b $BUCKET - Confirm ingest completed for some objects:
dodil k3 ingest jobs -b $BUCKET -p $PIPELINE_ID -o json | jq '.jobs[] | select(.status == "INGEST_STATUS_COMPLETED")' | head - Lower
--min-scoreto0.0and retest — your model may produce scores in a tighter range than the default threshold - Try hybrid + rerank via the API — see Vector → Hybrid + Rerank. The CLI’s
searchis text-only with no rerank.
5. Tables drain + maintenance
For tables under active write load:
# Inspect the WAL backlog + drain lag SLO
dodil k3 table describe "$TABLE" --bucket "$BUCKET" -o json \
| jq '{
version,
walObjectCount, walTotalBytes, drainLagSecs,
lastDrainTargetVersion,
lastDrainInsertedRows, lastDrainUpdatedRows, lastDrainDeletedRows
}'Healthy state: walObjectCount and drainLagSecs close to 0. Persistent backlog means the compactor isn’t keeping up — force a drain:
# Drain until empty
while dodil k3 table compact "$TABLE" --bucket "$BUCKET" -o json | jq -e '.truncated == true' > /dev/null; do
echo " more entries to drain..."
done
# Then optionally bin-pack small files
dodil k3 table optimize "$TABLE" --bucket "$BUCKET" -o json | jq '{filesRemoved, filesAdded, bytesRemoved, bytesAdded}'For full per-RPC maintenance details, see Tables → Maintenance and the Tables Maintenance CLI guide.
6. Common failure patterns
What the auth-layer status codes (401 / 402 / 403) mean is defined in Auth & Access; the table below maps operational symptoms to concrete fixes.
| Symptom | Likely cause | Fix |
|---|---|---|
401 Unauthorized on every request | Token expired | Re-run dodil login; verify ~/.config/dodil/config.yaml has a fresh token |
403 AccessDenied from S3 byte-plane PUT | Storage quota exhausted | Check org quota with admin; bucket-level policy may deny |
| Object upload fails with no clear error | Missing https:// in --api-endpoint for object create | The upload uses HTTP, not gRPC — endpoint must be HTTPS |
| Ingest jobs stay idle after upload | Auto-rule globs don’t match path / disabled / missing | dodil k3 ingest list -b $BUCKET — confirm enabled: true and includePatterns cover the upload path |
Search returns FAILED_PRECONDITION on text queries | Collection has no embed_model set (EXTERNAL collection without it) | Pre-embed query-side and use vector query shape; or set --embed-model at create time |
| Table query errors with “engine not enabled” | The bucket’s Tables engine got disabled or never enabled | dodil k3 engine get -b $BUCKET — if disabled, dodil k3 engine enable -b $BUCKET |
| Vector search returns 0 results across multi-collection | Compatibility group mismatch — query model differs from collections | Inspect collection_statuses[] for failReason: "incompatible embed_model" — see Vector → Multi-collection Search |
| Strong-freshness Tables query truncated | Write-log scan hit per-call cap | Reduce limit or force a Compact first — see Tables → Data → Query |
MERGE response shows rowsWritten but describe.lastDrain* is zero | Compactor hasn’t run on the new entries yet | Run dodil k3 table compact $TABLE -b $BUCKET; re-describe |
7. Where to dig deeper
| If you’re investigating… | Start at |
|---|---|
| Ingest failures / replay | Pipelines → Replay & Retry |
| Search quality / latency | Vector → Hybrid + Rerank |
| Table backlog / drain lag | Tables → Maintenance + Tables → CLI maintenance |
| Time-travel / restore | Tables → Time Travel & Restore |
| Multi-collection observability | Vector → Multi-collection Search |
| S3 protocol compatibility | Storage → S3 Compatibility |
See also
- Conventions — auth headers + error envelope
- Feature Status — what’s live vs roadmap
- Auth and Access — auth modes detail