Skip to Content
We are live but in Staging 🎉
Operations

Operations

Day-2 operations and incident triage. The shortest path from “something feels off” to “I know which surface to inspect.”

1. Service health + readiness

curl -sS "https://k3.dev.dodil.io/health" curl -sS "https://k3.dev.dodil.io/healthz" curl -sS "https://k3.dev.dodil.io/ready" curl -sS "https://k3.dev.dodil.io/readyz" curl -sS "https://k3.dev.dodil.io/metrics"

All return 200 + a short body on a healthy K3. /metrics is Prometheus-shaped — scrape it for SLO dashboards. These five paths are the only ones that bypass auth — see Auth & Access → Unauthenticated endpoints.

2. Bucket + engine baseline

# Storage — list your org's buckets dodil k3 bucket list # Tables — confirm the tables engine is ACTIVE on the bucket dodil k3 engine get --bucket "$BUCKET" -o json | jq '.status' # expect: ENGINE_STATUS_ACTIVE (auto-enabled at bucket creation) # Vector — confirm the vector engine is ACTIVE (you create this one explicitly) dodil k3 vector store get --bucket "$BUCKET" -o json | jq '.status' # expect: ENGINE_STATUS_ACTIVE (after `vector store create -m auto` finishes)

If the Vector engine is missing entirely, vector store get returns an error — that means you haven’t configured the bucket for vector workloads yet. Run dodil k3 vector store create -b $BUCKET -m auto.

3. Pipelines health

# All rules in the bucket dodil k3 ingest list --bucket "$BUCKET" -o json | jq '.rules[] | {name, enabled, pipelineId}' # All jobs in the bucket — focus on FAILED + RETRYING dodil k3 ingest jobs --bucket "$BUCKET" -o json \ | jq '[.jobs[] | select(.status | IN("INGEST_STATUS_FAILED", "INGEST_STATUS_RETRYING"))] | group_by(.status) | map({status: .[0].status, count: length})' # Stuck-pending count (ingest backlog) dodil k3 ingest jobs --bucket "$BUCKET" -o json \ | jq '[.jobs[] | select(.status == "INGEST_STATUS_PENDING")] | length'

Things to look for:

  • Repeated FAILED with the same errorDetails across many objects → a template / pipeline misconfiguration. Replay after fixing — see Pipelines → Replay & Retry.
  • Stuck PENDING for > 1 minute → check whether the worker is healthy via /metrics.
  • Persistent RETRYING with attempt N/M reaching N=M and flipping to FAILED → transient failure exhausted retries; likely a real problem (upstream model timeout, missing credential, etc.).
  • Mismatched pipelineId between a rule and its pipeline → orphaned rule (the pipeline was deleted). Re-bind: dodil k3 ingest update <rule_id> -b $BUCKET --pipeline <new_pipeline_id>.

4. Search smoke check

dodil k3 search "health probe query" --bucket "$BUCKET" --collection "$COLLECTION" --top-k 5

If you get zero results:

  1. Confirm the collection exists: dodil k3 vector collection get $COLLECTION -b $BUCKET
  2. Confirm ingest completed for some objects: dodil k3 ingest jobs -b $BUCKET -p $PIPELINE_ID -o json | jq '.jobs[] | select(.status == "INGEST_STATUS_COMPLETED")' | head
  3. Lower --min-score to 0.0 and retest — your model may produce scores in a tighter range than the default threshold
  4. Try hybrid + rerank via the API — see Vector → Hybrid + Rerank. The CLI’s search is text-only with no rerank.

5. Tables drain + maintenance

For tables under active write load:

# Inspect the WAL backlog + drain lag SLO dodil k3 table describe "$TABLE" --bucket "$BUCKET" -o json \ | jq '{ version, walObjectCount, walTotalBytes, drainLagSecs, lastDrainTargetVersion, lastDrainInsertedRows, lastDrainUpdatedRows, lastDrainDeletedRows }'

Healthy state: walObjectCount and drainLagSecs close to 0. Persistent backlog means the compactor isn’t keeping up — force a drain:

# Drain until empty while dodil k3 table compact "$TABLE" --bucket "$BUCKET" -o json | jq -e '.truncated == true' > /dev/null; do echo " more entries to drain..." done # Then optionally bin-pack small files dodil k3 table optimize "$TABLE" --bucket "$BUCKET" -o json | jq '{filesRemoved, filesAdded, bytesRemoved, bytesAdded}'

For full per-RPC maintenance details, see Tables → Maintenance and the Tables Maintenance CLI guide.

6. Common failure patterns

What the auth-layer status codes (401 / 402 / 403) mean is defined in Auth & Access; the table below maps operational symptoms to concrete fixes.

SymptomLikely causeFix
401 Unauthorized on every requestToken expiredRe-run dodil login; verify ~/.config/dodil/config.yaml has a fresh token
403 AccessDenied from S3 byte-plane PUTStorage quota exhaustedCheck org quota with admin; bucket-level policy may deny
Object upload fails with no clear errorMissing https:// in --api-endpoint for object createThe upload uses HTTP, not gRPC — endpoint must be HTTPS
Ingest jobs stay idle after uploadAuto-rule globs don’t match path / disabled / missingdodil k3 ingest list -b $BUCKET — confirm enabled: true and includePatterns cover the upload path
Search returns FAILED_PRECONDITION on text queriesCollection has no embed_model set (EXTERNAL collection without it)Pre-embed query-side and use vector query shape; or set --embed-model at create time
Table query errors with “engine not enabled”The bucket’s Tables engine got disabled or never enableddodil k3 engine get -b $BUCKET — if disabled, dodil k3 engine enable -b $BUCKET
Vector search returns 0 results across multi-collectionCompatibility group mismatch — query model differs from collectionsInspect collection_statuses[] for failReason: "incompatible embed_model" — see Vector → Multi-collection Search
Strong-freshness Tables query truncatedWrite-log scan hit per-call capReduce limit or force a Compact first — see Tables → Data → Query
MERGE response shows rowsWritten but describe.lastDrain* is zeroCompactor hasn’t run on the new entries yetRun dodil k3 table compact $TABLE -b $BUCKET; re-describe

7. Where to dig deeper

If you’re investigating…Start at
Ingest failures / replayPipelines → Replay & Retry
Search quality / latencyVector → Hybrid + Rerank
Table backlog / drain lagTables → Maintenance + Tables → CLI maintenance
Time-travel / restoreTables → Time Travel & Restore
Multi-collection observabilityVector → Multi-collection Search
S3 protocol compatibilityStorage → S3 Compatibility

See also