Skip to Content
We are live but in Staging 🎉
RecipesDocument Intake

Document Intake — Triage + Semantic Search

Goal: every uploaded document gets simultaneously triaged for routing (classify → priority → entity extraction → assigned team) AND indexed for retrieval (chunked → embedded → searchable). One ingest source, two parallel pipelines.

Why this matters: an enterprise document-intake system needs both. Triage answers “what is this and where does it go?” (a SQL-shaped question — counts, dashboards, routing rules). Retrieval answers “show me similar documents to this one” (a semantic question — RAG, Q&A, related-cases recall).

Primitives used: Storage + Pipelines (two pipelines, same internal-S3 source) + Tables + Vector.

Shape:

document upload ──► Storage bucket ▼ internal-S3 source matches both rules ┌─────────┴─────────┐ │ │ ▼ ▼ document_triage text_embedding_index Scriptum pipeline Scriptum pipeline │ │ ▼ ▼ Tables: intake Vector: intake_vec (one row per doc: (N chunks per doc, classify, route, semantic recall) entities, urgency) │ │ ▼ ▼ Routing layer "Similar cases" (assigns to team) (RAG / agent grounding)

Compared to Reviews Dashboard: same fan-out shape, different templates. Reviews focuses on per-record analytics; this recipe focuses on per-document routing + retrieval.

Prerequisites

  • dodil CLI + dodil login
  • A bucket — kb-intake:
    dodil k3 bucket create kb-intake -d "Document intake pipeline"
  • Vector engine configured (Tables engine is auto-on):
    dodil k3 vector store create -b kb-intake -m auto # Wait for ACTIVE

1. Browse the two templates

document_triage (warehouse-compatible, emits structured rows) and text_embedding_index (vector-compatible). They accept overlapping modalities:

# Tables-side dodil k3 table templates --search triage -o json | jq '.templates[] | {id, modalities, acceptedExtensions}' # Vector-side dodil k3 vector templates --search text -o json | jq '.templates[] | {id, modalities, acceptedExtensions}'

Both accept text + PDF + docx + HTML + email. Inspect each template’s contract to understand what it produces:

dodil k3 template get document_triage -o json | jq '.contract' dodil k3 template get text_embedding_index -o json | jq '.contract'

2. Create the two destinations

A. Pipeline-bound table for triage

dodil k3 table create intake -b kb-intake \ --description "Document intake — triage classification + entities" \ --source pipeline_generated \ --pipeline-template-id document_triage

document_triage typically emits per-document fields like: source_key, document_type, topic, language, priority, entities (a JSON array), routing_team, sentiment, urgency_score, created_at.

B. Pipeline-mode vector collection for retrieval

dodil k3 vector collection add intake_vec -b kb-intake \ --description "Document intake — semantic chunks for retrieval" \ --template text_embedding_index

Capture identifiers:

export TABLE_PIPELINE_ID=$(dodil k3 table get intake -b kb-intake -o json | jq -r '.pipelineId') export VECTOR_PIPELINE_ID=$(dodil k3 vector collection get intake_vec -b kb-intake -o json | jq -r '.embedPipelineId') # Two rules now exist on this bucket dodil k3 ingest list -b kb-intake -o json \ | jq '.rules[] | {name, pipelineId, includePatterns, includeMimeTypes, enabled}'

3. Upload an intake document

Stage a sample document representing an inbound customer support PDF:

curl -sSL https://example.com/customer-letter.pdf -o customer-letter.pdf 2>/dev/null || \ cat > customer-letter.txt <<'EOF' Subject: Service interruption — extended outage on May 23 Dear Support Team, Our company experienced a complete service outage from May 23 09:00 to May 23 14:30 UTC, affecting our production billing systems. This is the third such incident in two months. We require: 1. A formal incident root-cause analysis (RCA) within 5 business days 2. Service credits per our SLA (4.5h downtime exceeds 99.9% monthly threshold) 3. A direct contact for escalation: Jane Smith (jane@example.com, +1-555-0142) Please escalate to your account executive. Our contract: ACME-CONTRACT-2024-INT-091. Sincerely, Operations Team, Example Corp EOF mv customer-letter.txt customer-letter.pdf 2>/dev/null || true dodil k3 object create ./customer-letter.pdf -b kb-intake -k inbox/2026-05-27/customer-letter.pdf

Two pipelines fire in parallel.

4. Watch both pipelines complete

# Tables pipeline — produces ONE row per document dodil k3 ingest jobs -b kb-intake -p "$TABLE_PIPELINE_ID" -o json \ | jq '.jobs[] | {pipeline: "triage", object: .object.key, status, rowsWritten}' # Vector pipeline — produces MANY chunks per document dodil k3 ingest jobs -b kb-intake -p "$VECTOR_PIPELINE_ID" -o json \ | jq '.jobs[] | {pipeline: "embed", object: .object.key, status, chunksCreated, embeddingsWritten}'

Different “shapes” per pillar:

PipelineOutput per documentLatency typical
document_triage1 row (classification + entities + routing decision)~1–5 s
text_embedding_indexN chunks (token-based chunking) → N embeddings~3–15 s

The triage pipeline finishes first usually; your routing layer can act on triage results before retrieval is fully indexed.

5. Query the triage table — routing logic

After the Tables pipeline drains, the document’s triage row is queryable:

# What triage said about this specific document dodil k3 table query " SELECT * FROM intake WHERE source_key = 'inbox/2026-05-27/customer-letter.pdf' " -b kb-intake -o json | jq '.results'

Sample row (template-specific fields):

{ "source_key": "inbox/2026-05-27/customer-letter.pdf", "document_type": "complaint", "topic": "service_outage", "priority": "high", "urgency_score": 0.87, "entities": ["ACME-CONTRACT-2024-INT-091", "Jane Smith", "jane@example.com", "Example Corp"], "routing_team": "tier-2-support", "language": "en", "extracted_at": 1716840000000000 }

Operational dashboard queries

# What's incoming today? dodil k3 table query " SELECT document_type, priority, COUNT(*) AS n FROM intake WHERE extracted_at >= DATE_TRUNC('day', NOW()) GROUP BY document_type, priority ORDER BY priority DESC, n DESC " -b kb-intake # Top topics for tier-2 routing queue dodil k3 table query " SELECT topic, COUNT(*) AS n FROM intake WHERE routing_team = 'tier-2-support' AND urgency_score > 0.5 GROUP BY topic ORDER BY n DESC LIMIT 10 " -b kb-intake # Outliers needing immediate review dodil k3 table query " SELECT source_key, document_type, urgency_score FROM intake WHERE urgency_score > 0.8 AND extracted_at >= NOW() - INTERVAL 1 HOUR ORDER BY urgency_score DESC " -b kb-intake

These power your team-routing layer — e.g. a worker process polls the table every minute and dispatches each new high-urgency document to a queue / Slack / Jira based on routing_team.

6. Query the vector collection — “similar documents”

When a triage decision is made (or an analyst is reading a specific document), the next question is usually “have we seen this before?” Vector handles it:

curl -sS -X POST "https://k3.dev.dodil.io/kb-intake/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-intake", "collectionName": "intake_vec", "s3Key": "inbox/2026-05-27/customer-letter.pdf", "topK": 10, "searchMode": "SEARCH_MODE_AUTO", "rerank": true, "includeContent": true }' | jq '.results[] | { score, object: .object.key, chunkIndex, content: (.content | .[0:200]) }'

This searches the entire intake_vec collection for chunks semantically similar to chunks of the query document — including the document itself (filter that out client-side using .object.key != "inbox/2026-05-27/customer-letter.pdf").

Returns previous customer letters with similar themes (outages, RCA requests, escalation patterns) — useful context for the agent / analyst handling the new document.

7. The killer pattern — routing decision + similar-case recall in one workflow

A realistic agent workflow combining both pillars:

import requests, os K3 = "https://k3.dev.dodil.io" HEADERS = { "Authorization": f"Bearer {os.environ['DODIL_TOKEN']}", "Content-Type": "application/json", } def handle_intake(object_key: str): """ 1. Get triage decision from Tables. 2. Find semantically similar past documents from Vector. 3. Hand decision + context to the routing layer. """ # 1. Pull the triage row (read-your-writes: STRONG freshness ensures we see # the row even if it just landed via the triage pipeline) triage = requests.post( f"{K3}/kb-intake/tables/_query", headers=HEADERS, json={ "bucket": "kb-intake", "tableName": "intake", "sql": "SELECT * FROM {table} WHERE source_key = $1", "freshness": "FRESHNESS_STRONG", # Parameter substitution depends on K3's SQL flavor; using literal here for illustration }, ).json() if not triage.get("rows"): # Triage pipeline hasn't drained yet — retry shortly return {"status": "pending"} row = triage["rows"][0] decision = { "type": row[1], # document_type "topic": row[2], "priority": row[3], "team": row[5], # routing_team "entities": row[4], } # 2. Find similar past documents similar = requests.post( f"{K3}/kb-intake/vector/search", headers=HEADERS, json={ "bucket": "kb-intake", "collectionName": "intake_vec", "s3Key": object_key, "topK": 6, # 1 will be self "searchMode": "SEARCH_MODE_AUTO", "rerank": True, "includeContent": True, "preFilter": { # only narrow to the same routing team's history "op": "LOGICAL_OP_AND", "filters": [ {"field": "source_key", "op": "FILTER_OP_NEQ", "value": object_key}, ], }, }, ).json() context = [ {"key": r["object"]["key"], "score": r["score"], "preview": r["content"][:200]} for r in similar["results"][:5] ] return { "status": "ready", "decision": decision, "similar_cases": context, } # Use it result = handle_intake("inbox/2026-05-27/customer-letter.pdf") print(result)

This is the pattern that makes K3 different from a stack with separate document AI + vector DB products: both signals land from one upload, queryable side-by-side.

8. Operational patterns

Filter vector search to “same team’s queue”

When the routing layer dispatches a document to a team, you want similar-case retrieval scoped to that team’s history, not the whole org:

# Step 1: SQL — get keys of past tier-2 support docs TIER2_KEYS=$(dodil k3 table query " SELECT source_key FROM intake WHERE routing_team = 'tier-2-support' " -b kb-intake -o json | jq -r '.results[][0]' | tr '\n' ',' | sed 's/,$//') # Step 2: Vector search pre-filtered to those keys curl -sS -X POST "https://k3.dev.dodil.io/kb-intake/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d "{ \"bucket\": \"kb-intake\", \"collectionName\": \"intake_vec\", \"text\": \"recurring service outage SLA breach\", \"topK\": 5, \"preFilter\": { \"op\": \"LOGICAL_OP_AND\", \"filters\": [ { \"field\": \"source_key\", \"op\": \"FILTER_OP_IN\", \"value\": \"$TIER2_KEYS\" } ] } }"

Replay one pipeline after template update

If document_triage ships a new version (or you re-tune its prompt) but text_embedding_index is unchanged, replay only the Tables side:

TABLE_RULE_ID=$(dodil k3 ingest list -b kb-intake -p "$TABLE_PIPELINE_ID" -o json | jq -r '.rules[0].ruleId') SOURCE_ID=$(curl -sS "https://k3.dev.dodil.io/kb-intake/sources" \ -H "Authorization: Bearer $DODIL_TOKEN" \ | jq -r '.sources[] | select(.name == "internal") | .sourceId') # Re-dispatch every object through ONLY the Tables pipeline dodil k3 ingest trigger -b kb-intake -s "$SOURCE_ID" --rule "$TABLE_RULE_ID"

The Vector collection is untouched.

Pause incoming docs without losing history

TABLE_RULE_ID=... VECTOR_RULE_ID=... dodil k3 ingest update "$TABLE_RULE_ID" -b kb-intake --enabled=false dodil k3 ingest update "$VECTOR_RULE_ID" -b kb-intake --enabled=false # New uploads accumulate in Storage; existing data in Tables + Vector is unchanged. # Re-enable when ready, then `trigger-discovery --full-sync` to catch up.

Common gotchas

SymptomCauseFix
Triage row exists but intake_vec has no chunks for the documentVector pipeline is slower; check job statusdodil k3 ingest jobs -b kb-intake -p $VECTOR_PIPELINE_ID -o json — wait for the matching object’s job to be COMPLETED
routing_team is null on a rowdocument_triage couldn’t classify (too short / non-English / empty)Inspect errorDetails on the job; consider pre-filtering at the upload layer for known-bad inputs
Pre-filter by source_key IN (long list) is slowThe IN operator with thousands of values stresses MilvusFor large narrowing sets, materialize a tag in the metadata (e.g. routing_team) at ingest time via custom template inputs, then filter on it directly
Re-uploading the same document accumulates rows in intakePipeline-mode tables don’t auto-dedup on PKEither set up the table’s primary key in the template’s contract (some templates support it) or post-process duplicates via MERGE; use source_key + extracted_at as a composite for natural ordering
Tables sees the document but routing decision feels wrongdocument_triage’s classification confidence isn’t surfaced as a fieldCustom-tune the triage by switching to a different template, or hook a downstream Scriptum step that rewrites the decision

Cleanup

TABLE_RULE_ID=$(dodil k3 ingest list -b kb-intake -p "$TABLE_PIPELINE_ID" -o json | jq -r '.rules[0].ruleId') VECTOR_RULE_ID=$(dodil k3 ingest list -b kb-intake -p "$VECTOR_PIPELINE_ID" -o json | jq -r '.rules[0].ruleId') dodil k3 ingest update "$TABLE_RULE_ID" -b kb-intake --enabled=false dodil k3 ingest update "$VECTOR_RULE_ID" -b kb-intake --enabled=false dodil k3 ingest delete "$TABLE_RULE_ID" -b kb-intake dodil k3 ingest delete "$VECTOR_RULE_ID" -b kb-intake dodil k3 pipeline delete "$TABLE_PIPELINE_ID" -b kb-intake dodil k3 pipeline delete "$VECTOR_PIPELINE_ID" -b kb-intake dodil k3 table delete intake -b kb-intake VEC_COL=$(dodil k3 vector collection get intake_vec -b kb-intake -o json | jq -r '.collectionId') dodil k3 vector collection delete "$VEC_COL" -b kb-intake dodil k3 vector store delete -b kb-intake dodil k3 bucket delete kb-intake

See also