Document Intake — Triage + Semantic Search
Goal: every uploaded document gets simultaneously triaged for routing (classify → priority → entity extraction → assigned team) AND indexed for retrieval (chunked → embedded → searchable). One ingest source, two parallel pipelines.
Why this matters: an enterprise document-intake system needs both. Triage answers “what is this and where does it go?” (a SQL-shaped question — counts, dashboards, routing rules). Retrieval answers “show me similar documents to this one” (a semantic question — RAG, Q&A, related-cases recall).
Primitives used: Storage + Pipelines (two pipelines, same internal-S3 source) + Tables + Vector.
Shape:
document upload ──► Storage bucket
│
▼ internal-S3 source matches both rules
┌─────────┴─────────┐
│ │
▼ ▼
document_triage text_embedding_index
Scriptum pipeline Scriptum pipeline
│ │
▼ ▼
Tables: intake Vector: intake_vec
(one row per doc: (N chunks per doc,
classify, route, semantic recall)
entities, urgency)
│ │
▼ ▼
Routing layer "Similar cases"
(assigns to team) (RAG / agent grounding)Compared to Reviews Dashboard: same fan-out shape, different templates. Reviews focuses on per-record analytics; this recipe focuses on per-document routing + retrieval.
Prerequisites
dodilCLI +dodil login- A bucket —
kb-intake:dodil k3 bucket create kb-intake -d "Document intake pipeline" - Vector engine configured (Tables engine is auto-on):
dodil k3 vector store create -b kb-intake -m auto # Wait for ACTIVE
1. Browse the two templates
document_triage (warehouse-compatible, emits structured rows) and text_embedding_index (vector-compatible). They accept overlapping modalities:
# Tables-side
dodil k3 table templates --search triage -o json | jq '.templates[] | {id, modalities, acceptedExtensions}'
# Vector-side
dodil k3 vector templates --search text -o json | jq '.templates[] | {id, modalities, acceptedExtensions}'Both accept text + PDF + docx + HTML + email. Inspect each template’s contract to understand what it produces:
dodil k3 template get document_triage -o json | jq '.contract'
dodil k3 template get text_embedding_index -o json | jq '.contract'2. Create the two destinations
A. Pipeline-bound table for triage
dodil k3 table create intake -b kb-intake \
--description "Document intake — triage classification + entities" \
--source pipeline_generated \
--pipeline-template-id document_triagedocument_triage typically emits per-document fields like: source_key, document_type, topic, language, priority, entities (a JSON array), routing_team, sentiment, urgency_score, created_at.
B. Pipeline-mode vector collection for retrieval
dodil k3 vector collection add intake_vec -b kb-intake \
--description "Document intake — semantic chunks for retrieval" \
--template text_embedding_indexCapture identifiers:
export TABLE_PIPELINE_ID=$(dodil k3 table get intake -b kb-intake -o json | jq -r '.pipelineId')
export VECTOR_PIPELINE_ID=$(dodil k3 vector collection get intake_vec -b kb-intake -o json | jq -r '.embedPipelineId')
# Two rules now exist on this bucket
dodil k3 ingest list -b kb-intake -o json \
| jq '.rules[] | {name, pipelineId, includePatterns, includeMimeTypes, enabled}'3. Upload an intake document
Stage a sample document representing an inbound customer support PDF:
curl -sSL https://example.com/customer-letter.pdf -o customer-letter.pdf 2>/dev/null || \
cat > customer-letter.txt <<'EOF'
Subject: Service interruption — extended outage on May 23
Dear Support Team,
Our company experienced a complete service outage from May 23 09:00 to May 23 14:30 UTC,
affecting our production billing systems. This is the third such incident in two months.
We require:
1. A formal incident root-cause analysis (RCA) within 5 business days
2. Service credits per our SLA (4.5h downtime exceeds 99.9% monthly threshold)
3. A direct contact for escalation: Jane Smith (jane@example.com, +1-555-0142)
Please escalate to your account executive. Our contract: ACME-CONTRACT-2024-INT-091.
Sincerely,
Operations Team, Example Corp
EOF
mv customer-letter.txt customer-letter.pdf 2>/dev/null || true
dodil k3 object create ./customer-letter.pdf -b kb-intake -k inbox/2026-05-27/customer-letter.pdfTwo pipelines fire in parallel.
4. Watch both pipelines complete
# Tables pipeline — produces ONE row per document
dodil k3 ingest jobs -b kb-intake -p "$TABLE_PIPELINE_ID" -o json \
| jq '.jobs[] | {pipeline: "triage", object: .object.key, status, rowsWritten}'
# Vector pipeline — produces MANY chunks per document
dodil k3 ingest jobs -b kb-intake -p "$VECTOR_PIPELINE_ID" -o json \
| jq '.jobs[] | {pipeline: "embed", object: .object.key, status, chunksCreated, embeddingsWritten}'Different “shapes” per pillar:
| Pipeline | Output per document | Latency typical |
|---|---|---|
document_triage | 1 row (classification + entities + routing decision) | ~1–5 s |
text_embedding_index | N chunks (token-based chunking) → N embeddings | ~3–15 s |
The triage pipeline finishes first usually; your routing layer can act on triage results before retrieval is fully indexed.
5. Query the triage table — routing logic
After the Tables pipeline drains, the document’s triage row is queryable:
# What triage said about this specific document
dodil k3 table query "
SELECT *
FROM intake
WHERE source_key = 'inbox/2026-05-27/customer-letter.pdf'
" -b kb-intake -o json | jq '.results'Sample row (template-specific fields):
{
"source_key": "inbox/2026-05-27/customer-letter.pdf",
"document_type": "complaint",
"topic": "service_outage",
"priority": "high",
"urgency_score": 0.87,
"entities": ["ACME-CONTRACT-2024-INT-091", "Jane Smith", "jane@example.com", "Example Corp"],
"routing_team": "tier-2-support",
"language": "en",
"extracted_at": 1716840000000000
}Operational dashboard queries
# What's incoming today?
dodil k3 table query "
SELECT document_type, priority, COUNT(*) AS n
FROM intake
WHERE extracted_at >= DATE_TRUNC('day', NOW())
GROUP BY document_type, priority
ORDER BY priority DESC, n DESC
" -b kb-intake
# Top topics for tier-2 routing queue
dodil k3 table query "
SELECT topic, COUNT(*) AS n
FROM intake
WHERE routing_team = 'tier-2-support'
AND urgency_score > 0.5
GROUP BY topic
ORDER BY n DESC
LIMIT 10
" -b kb-intake
# Outliers needing immediate review
dodil k3 table query "
SELECT source_key, document_type, urgency_score
FROM intake
WHERE urgency_score > 0.8
AND extracted_at >= NOW() - INTERVAL 1 HOUR
ORDER BY urgency_score DESC
" -b kb-intakeThese power your team-routing layer — e.g. a worker process polls the table every minute and dispatches each new high-urgency document to a queue / Slack / Jira based on routing_team.
6. Query the vector collection — “similar documents”
When a triage decision is made (or an analyst is reading a specific document), the next question is usually “have we seen this before?” Vector handles it:
curl -sS -X POST "https://k3.dev.dodil.io/kb-intake/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-intake",
"collectionName": "intake_vec",
"s3Key": "inbox/2026-05-27/customer-letter.pdf",
"topK": 10,
"searchMode": "SEARCH_MODE_AUTO",
"rerank": true,
"includeContent": true
}' | jq '.results[] | {
score,
object: .object.key,
chunkIndex,
content: (.content | .[0:200])
}'This searches the entire intake_vec collection for chunks semantically similar to chunks of the query document — including the document itself (filter that out client-side using .object.key != "inbox/2026-05-27/customer-letter.pdf").
Returns previous customer letters with similar themes (outages, RCA requests, escalation patterns) — useful context for the agent / analyst handling the new document.
7. The killer pattern — routing decision + similar-case recall in one workflow
A realistic agent workflow combining both pillars:
import requests, os
K3 = "https://k3.dev.dodil.io"
HEADERS = {
"Authorization": f"Bearer {os.environ['DODIL_TOKEN']}",
"Content-Type": "application/json",
}
def handle_intake(object_key: str):
"""
1. Get triage decision from Tables.
2. Find semantically similar past documents from Vector.
3. Hand decision + context to the routing layer.
"""
# 1. Pull the triage row (read-your-writes: STRONG freshness ensures we see
# the row even if it just landed via the triage pipeline)
triage = requests.post(
f"{K3}/kb-intake/tables/_query",
headers=HEADERS,
json={
"bucket": "kb-intake",
"tableName": "intake",
"sql": "SELECT * FROM {table} WHERE source_key = $1",
"freshness": "FRESHNESS_STRONG",
# Parameter substitution depends on K3's SQL flavor; using literal here for illustration
},
).json()
if not triage.get("rows"):
# Triage pipeline hasn't drained yet — retry shortly
return {"status": "pending"}
row = triage["rows"][0]
decision = {
"type": row[1], # document_type
"topic": row[2],
"priority": row[3],
"team": row[5], # routing_team
"entities": row[4],
}
# 2. Find similar past documents
similar = requests.post(
f"{K3}/kb-intake/vector/search",
headers=HEADERS,
json={
"bucket": "kb-intake",
"collectionName": "intake_vec",
"s3Key": object_key,
"topK": 6, # 1 will be self
"searchMode": "SEARCH_MODE_AUTO",
"rerank": True,
"includeContent": True,
"preFilter": { # only narrow to the same routing team's history
"op": "LOGICAL_OP_AND",
"filters": [
{"field": "source_key", "op": "FILTER_OP_NEQ", "value": object_key},
],
},
},
).json()
context = [
{"key": r["object"]["key"], "score": r["score"], "preview": r["content"][:200]}
for r in similar["results"][:5]
]
return {
"status": "ready",
"decision": decision,
"similar_cases": context,
}
# Use it
result = handle_intake("inbox/2026-05-27/customer-letter.pdf")
print(result)This is the pattern that makes K3 different from a stack with separate document AI + vector DB products: both signals land from one upload, queryable side-by-side.
8. Operational patterns
Filter vector search to “same team’s queue”
When the routing layer dispatches a document to a team, you want similar-case retrieval scoped to that team’s history, not the whole org:
# Step 1: SQL — get keys of past tier-2 support docs
TIER2_KEYS=$(dodil k3 table query "
SELECT source_key FROM intake
WHERE routing_team = 'tier-2-support'
" -b kb-intake -o json | jq -r '.results[][0]' | tr '\n' ',' | sed 's/,$//')
# Step 2: Vector search pre-filtered to those keys
curl -sS -X POST "https://k3.dev.dodil.io/kb-intake/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d "{
\"bucket\": \"kb-intake\",
\"collectionName\": \"intake_vec\",
\"text\": \"recurring service outage SLA breach\",
\"topK\": 5,
\"preFilter\": {
\"op\": \"LOGICAL_OP_AND\",
\"filters\": [
{ \"field\": \"source_key\", \"op\": \"FILTER_OP_IN\", \"value\": \"$TIER2_KEYS\" }
]
}
}"Replay one pipeline after template update
If document_triage ships a new version (or you re-tune its prompt) but text_embedding_index is unchanged, replay only the Tables side:
TABLE_RULE_ID=$(dodil k3 ingest list -b kb-intake -p "$TABLE_PIPELINE_ID" -o json | jq -r '.rules[0].ruleId')
SOURCE_ID=$(curl -sS "https://k3.dev.dodil.io/kb-intake/sources" \
-H "Authorization: Bearer $DODIL_TOKEN" \
| jq -r '.sources[] | select(.name == "internal") | .sourceId')
# Re-dispatch every object through ONLY the Tables pipeline
dodil k3 ingest trigger -b kb-intake -s "$SOURCE_ID" --rule "$TABLE_RULE_ID"The Vector collection is untouched.
Pause incoming docs without losing history
TABLE_RULE_ID=...
VECTOR_RULE_ID=...
dodil k3 ingest update "$TABLE_RULE_ID" -b kb-intake --enabled=false
dodil k3 ingest update "$VECTOR_RULE_ID" -b kb-intake --enabled=false
# New uploads accumulate in Storage; existing data in Tables + Vector is unchanged.
# Re-enable when ready, then `trigger-discovery --full-sync` to catch up.Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
Triage row exists but intake_vec has no chunks for the document | Vector pipeline is slower; check job status | dodil k3 ingest jobs -b kb-intake -p $VECTOR_PIPELINE_ID -o json — wait for the matching object’s job to be COMPLETED |
routing_team is null on a row | document_triage couldn’t classify (too short / non-English / empty) | Inspect errorDetails on the job; consider pre-filtering at the upload layer for known-bad inputs |
Pre-filter by source_key IN (long list) is slow | The IN operator with thousands of values stresses Milvus | For large narrowing sets, materialize a tag in the metadata (e.g. routing_team) at ingest time via custom template inputs, then filter on it directly |
Re-uploading the same document accumulates rows in intake | Pipeline-mode tables don’t auto-dedup on PK | Either set up the table’s primary key in the template’s contract (some templates support it) or post-process duplicates via MERGE; use source_key + extracted_at as a composite for natural ordering |
| Tables sees the document but routing decision feels wrong | document_triage’s classification confidence isn’t surfaced as a field | Custom-tune the triage by switching to a different template, or hook a downstream Scriptum step that rewrites the decision |
Cleanup
TABLE_RULE_ID=$(dodil k3 ingest list -b kb-intake -p "$TABLE_PIPELINE_ID" -o json | jq -r '.rules[0].ruleId')
VECTOR_RULE_ID=$(dodil k3 ingest list -b kb-intake -p "$VECTOR_PIPELINE_ID" -o json | jq -r '.rules[0].ruleId')
dodil k3 ingest update "$TABLE_RULE_ID" -b kb-intake --enabled=false
dodil k3 ingest update "$VECTOR_RULE_ID" -b kb-intake --enabled=false
dodil k3 ingest delete "$TABLE_RULE_ID" -b kb-intake
dodil k3 ingest delete "$VECTOR_RULE_ID" -b kb-intake
dodil k3 pipeline delete "$TABLE_PIPELINE_ID" -b kb-intake
dodil k3 pipeline delete "$VECTOR_PIPELINE_ID" -b kb-intake
dodil k3 table delete intake -b kb-intake
VEC_COL=$(dodil k3 vector collection get intake_vec -b kb-intake -o json | jq -r '.collectionId')
dodil k3 vector collection delete "$VEC_COL" -b kb-intake
dodil k3 vector store delete -b kb-intake
dodil k3 bucket delete kb-intakeSee also
- Reviews Dashboard — same fan-out shape with
review_analysisinstead ofdocument_triage; focused on per-record analytics vs per-record routing - Pipelines → Documents → Warehouse — deeper on the Tables-bound pipeline alone (with
entity_pii_extractiontemplate) - Pipelines → PDF → Vector — deeper on the Vector-bound pipeline alone
- Tables → Pipeline-bound Table — every detail of pipeline-mode tables
- Vector → Hybrid + Rerank — improve precision of the “similar documents” lookup
- Pipelines → Replay & Retry — when you need to re-run one of the two pipelines independently