Document Intake — Triage + Semantic Search

Goal: every uploaded document gets simultaneously triaged for routing (classify → priority → entity extraction → assigned team) AND indexed for retrieval (chunked → embedded → searchable). One ingest source, two parallel pipelines.

Why this matters: an enterprise document-intake system needs both. Triage answers “what is this and where does it go?” (a SQL-shaped question — counts, dashboards, routing rules). Retrieval answers “show me similar documents to this one” (a semantic question — RAG, Q&A, related-cases recall).

Primitives used: Storage + Pipelines (two pipelines, same internal-S3 source) + Tables + Vector.

Shape:


   document upload  ──►  Storage bucket
                              │
                              ▼  internal-S3 source matches both rules
                    ┌─────────┴─────────┐
                    │                   │
                    ▼                   ▼
            document_triage    text_embedding_index
            Scriptum pipeline   Scriptum pipeline
                    │                   │
                    ▼                   ▼
            Tables: intake      Vector: intake_vec
            (one row per doc:   (N chunks per doc,
             classify, route,    semantic recall)
             entities, urgency)
                    │                   │
                    ▼                   ▼
            Routing layer       "Similar cases"
            (assigns to team)   (RAG / agent grounding)

Compared to Reviews Dashboard: same fan-out shape, different templates. Reviews focuses on per-record analytics; this recipe focuses on per-document routing + retrieval.

Prerequisites

dodil CLI + dodil login

A bucket — kb-intake:


dodil k3 bucket create kb-intake -d "Document intake pipeline"

Vector engine configured (Tables engine is auto-on):


dodil k3 vector store create -b kb-intake -m auto
# Wait for ACTIVE

1. Browse the two templates

document_triage (warehouse-compatible, emits structured rows) and text_embedding_index (vector-compatible). They accept overlapping modalities:


# Tables-side
dodil k3 table templates --search triage -o json | jq '.templates[] | {id, modalities, acceptedExtensions}'
 
# Vector-side
dodil k3 vector templates --search text -o json | jq '.templates[] | {id, modalities, acceptedExtensions}'

Both accept text + PDF + docx + HTML + email. Inspect each template’s contract to understand what it produces:


dodil k3 template get document_triage -o json | jq '.contract'
dodil k3 template get text_embedding_index -o json | jq '.contract'

2. Create the two destinations

A. Pipeline-bound table for triage


dodil k3 table create intake -b kb-intake \
  --description "Document intake — triage classification + entities" \
  --source pipeline_generated \
  --pipeline-template-id document_triage

document_triage typically emits per-document fields like: source_key, document_type, topic, language, priority, entities (a JSON array), routing_team, sentiment, urgency_score, created_at.

B. Pipeline-mode vector collection for retrieval


dodil k3 vector collection add intake_vec -b kb-intake \
  --description "Document intake — semantic chunks for retrieval" \
  --template text_embedding_index

Capture identifiers:


export TABLE_PIPELINE_ID=$(dodil k3 table get intake -b kb-intake -o json | jq -r '.pipelineId')
export VECTOR_PIPELINE_ID=$(dodil k3 vector collection get intake_vec -b kb-intake -o json | jq -r '.embedPipelineId')
 
# Two rules now exist on this bucket
dodil k3 ingest list -b kb-intake -o json \
  | jq '.rules[] | {name, pipelineId, includePatterns, includeMimeTypes, enabled}'

3. Upload an intake document

Stage a sample document representing an inbound customer support PDF:


curl -sSL https://example.com/customer-letter.pdf -o customer-letter.pdf 2>/dev/null || \
cat > customer-letter.txt <<'EOF'
Subject: Service interruption — extended outage on May 23
 
Dear Support Team,
 
Our company experienced a complete service outage from May 23 09:00 to May 23 14:30 UTC,
affecting our production billing systems. This is the third such incident in two months.
 
We require:
1. A formal incident root-cause analysis (RCA) within 5 business days
2. Service credits per our SLA (4.5h downtime exceeds 99.9% monthly threshold)
3. A direct contact for escalation: Jane Smith (jane@example.com, +1-555-0142)
 
Please escalate to your account executive. Our contract: ACME-CONTRACT-2024-INT-091.
 
Sincerely,
Operations Team, Example Corp
EOF
mv customer-letter.txt customer-letter.pdf 2>/dev/null || true
dodil k3 object create ./customer-letter.pdf -b kb-intake -k inbox/2026-05-27/customer-letter.pdf

Two pipelines fire in parallel.

4. Watch both pipelines complete


# Tables pipeline — produces ONE row per document
dodil k3 ingest jobs -b kb-intake -p "$TABLE_PIPELINE_ID" -o json \
  | jq '.jobs[] | {pipeline: "triage", object: .object.key, status, rowsWritten}'
 
# Vector pipeline — produces MANY chunks per document
dodil k3 ingest jobs -b kb-intake -p "$VECTOR_PIPELINE_ID" -o json \
  | jq '.jobs[] | {pipeline: "embed", object: .object.key, status, chunksCreated, embeddingsWritten}'

Different “shapes” per pillar:

Pipeline	Output per document	Latency typical
`document_triage`	1 row (classification + entities + routing decision)	~1–5 s
`text_embedding_index`	N chunks (token-based chunking) → N embeddings	~3–15 s

The triage pipeline finishes first usually; your routing layer can act on triage results before retrieval is fully indexed.

5. Query the triage table — routing logic

After the Tables pipeline drains, the document’s triage row is queryable:


# What triage said about this specific document
dodil k3 table query "
  SELECT *
  FROM intake
  WHERE source_key = 'inbox/2026-05-27/customer-letter.pdf'
" -b kb-intake -o json | jq '.results'

Sample row (template-specific fields):


{
  "source_key": "inbox/2026-05-27/customer-letter.pdf",
  "document_type": "complaint",
  "topic": "service_outage",
  "priority": "high",
  "urgency_score": 0.87,
  "entities": ["ACME-CONTRACT-2024-INT-091", "Jane Smith", "jane@example.com", "Example Corp"],
  "routing_team": "tier-2-support",
  "language": "en",
  "extracted_at": 1716840000000000
}

Operational dashboard queries


# What's incoming today?
dodil k3 table query "
  SELECT document_type, priority, COUNT(*) AS n
  FROM intake
  WHERE extracted_at >= DATE_TRUNC('day', NOW())
  GROUP BY document_type, priority
  ORDER BY priority DESC, n DESC
" -b kb-intake
 
# Top topics for tier-2 routing queue
dodil k3 table query "
  SELECT topic, COUNT(*) AS n
  FROM intake
  WHERE routing_team = 'tier-2-support'
    AND urgency_score > 0.5
  GROUP BY topic
  ORDER BY n DESC
  LIMIT 10
" -b kb-intake
 
# Outliers needing immediate review
dodil k3 table query "
  SELECT source_key, document_type, urgency_score
  FROM intake
  WHERE urgency_score > 0.8
    AND extracted_at >= NOW() - INTERVAL 1 HOUR
  ORDER BY urgency_score DESC
" -b kb-intake

These power your team-routing layer — e.g. a worker process polls the table every minute and dispatches each new high-urgency document to a queue / Slack / Jira based on routing_team.

6. Query the vector collection — “similar documents”

When a triage decision is made (or an analyst is reading a specific document), the next question is usually “have we seen this before?” Vector handles it:


curl -sS -X POST "https://k3.dev.dodil.io/kb-intake/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-intake",
    "collectionName": "intake_vec",
    "s3Key": "inbox/2026-05-27/customer-letter.pdf",
    "topK": 10,
    "searchMode": "SEARCH_MODE_AUTO",
    "rerank": true,
    "includeContent": true
  }' | jq '.results[] | {
       score,
       object: .object.key,
       chunkIndex,
       content: (.content | .[0:200])
     }'

This searches the entire intake_vec collection for chunks semantically similar to chunks of the query document — including the document itself (filter that out client-side using .object.key != "inbox/2026-05-27/customer-letter.pdf").

Returns previous customer letters with similar themes (outages, RCA requests, escalation patterns) — useful context for the agent / analyst handling the new document.

7. The killer pattern — routing decision + similar-case recall in one workflow

A realistic agent workflow combining both pillars:


import requests, os
 
K3 = "https://k3.dev.dodil.io"
HEADERS = {
    "Authorization": f"Bearer {os.environ['DODIL_TOKEN']}",
    "Content-Type": "application/json",
}
 
def handle_intake(object_key: str):
    """
    1. Get triage decision from Tables.
    2. Find semantically similar past documents from Vector.
    3. Hand decision + context to the routing layer.
    """
 
    # 1. Pull the triage row (read-your-writes: STRONG freshness ensures we see
    #    the row even if it just landed via the triage pipeline)
    triage = requests.post(
        f"{K3}/kb-intake/tables/_query",
        headers=HEADERS,
        json={
            "bucket": "kb-intake",
            "tableName": "intake",
            "sql": "SELECT * FROM {table} WHERE source_key = $1",
            "freshness": "FRESHNESS_STRONG",
            # Parameter substitution depends on K3's SQL flavor; using literal here for illustration
        },
    ).json()
 
    if not triage.get("rows"):
        # Triage pipeline hasn't drained yet — retry shortly
        return {"status": "pending"}
 
    row = triage["rows"][0]
    decision = {
        "type": row[1],          # document_type
        "topic": row[2],
        "priority": row[3],
        "team": row[5],          # routing_team
        "entities": row[4],
    }
 
    # 2. Find similar past documents
    similar = requests.post(
        f"{K3}/kb-intake/vector/search",
        headers=HEADERS,
        json={
            "bucket": "kb-intake",
            "collectionName": "intake_vec",
            "s3Key": object_key,
            "topK": 6,           # 1 will be self
            "searchMode": "SEARCH_MODE_AUTO",
            "rerank": True,
            "includeContent": True,
            "preFilter": {       # only narrow to the same routing team's history
                "op": "LOGICAL_OP_AND",
                "filters": [
                    {"field": "source_key", "op": "FILTER_OP_NEQ", "value": object_key},
                ],
            },
        },
    ).json()
 
    context = [
        {"key": r["object"]["key"], "score": r["score"], "preview": r["content"][:200]}
        for r in similar["results"][:5]
    ]
 
    return {
        "status": "ready",
        "decision": decision,
        "similar_cases": context,
    }
 
# Use it
result = handle_intake("inbox/2026-05-27/customer-letter.pdf")
print(result)

This is the pattern that makes K3 different from a stack with separate document AI + vector DB products: both signals land from one upload, queryable side-by-side.

8. Operational patterns

Filter vector search to “same team’s queue”

When the routing layer dispatches a document to a team, you want similar-case retrieval scoped to that team’s history, not the whole org:


# Step 1: SQL — get keys of past tier-2 support docs
TIER2_KEYS=$(dodil k3 table query "
  SELECT source_key FROM intake
  WHERE routing_team = 'tier-2-support'
" -b kb-intake -o json | jq -r '.results[][0]' | tr '\n' ',' | sed 's/,$//')
 
# Step 2: Vector search pre-filtered to those keys
curl -sS -X POST "https://k3.dev.dodil.io/kb-intake/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{
    \"bucket\": \"kb-intake\",
    \"collectionName\": \"intake_vec\",
    \"text\": \"recurring service outage SLA breach\",
    \"topK\": 5,
    \"preFilter\": {
      \"op\": \"LOGICAL_OP_AND\",
      \"filters\": [
        { \"field\": \"source_key\", \"op\": \"FILTER_OP_IN\", \"value\": \"$TIER2_KEYS\" }
      ]
    }
  }"

Replay one pipeline after template update

If document_triage ships a new version (or you re-tune its prompt) but text_embedding_index is unchanged, replay only the Tables side:


TABLE_RULE_ID=$(dodil k3 ingest list -b kb-intake -p "$TABLE_PIPELINE_ID" -o json | jq -r '.rules[0].ruleId')
 
SOURCE_ID=$(curl -sS "https://k3.dev.dodil.io/kb-intake/sources" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  | jq -r '.sources[] | select(.name == "internal") | .sourceId')
 
# Re-dispatch every object through ONLY the Tables pipeline
dodil k3 ingest trigger -b kb-intake -s "$SOURCE_ID" --rule "$TABLE_RULE_ID"

The Vector collection is untouched.

Pause incoming docs without losing history


TABLE_RULE_ID=...
VECTOR_RULE_ID=...
 
dodil k3 ingest update "$TABLE_RULE_ID" -b kb-intake --enabled=false
dodil k3 ingest update "$VECTOR_RULE_ID" -b kb-intake --enabled=false
# New uploads accumulate in Storage; existing data in Tables + Vector is unchanged.
# Re-enable when ready, then `trigger-discovery --full-sync` to catch up.

Common gotchas

Symptom	Cause	Fix
Triage row exists but `intake_vec` has no chunks for the document	Vector pipeline is slower; check job status	`dodil k3 ingest jobs -b kb-intake -p $VECTOR_PIPELINE_ID -o json` — wait for the matching object’s job to be `COMPLETED`
`routing_team` is null on a row	`document_triage` couldn’t classify (too short / non-English / empty)	Inspect `errorDetails` on the job; consider pre-filtering at the upload layer for known-bad inputs
Pre-filter by `source_key` IN (long list) is slow	The `IN` operator with thousands of values stresses Milvus	For large narrowing sets, materialize a tag in the metadata (e.g. `routing_team`) at ingest time via custom template inputs, then filter on it directly
Re-uploading the same document accumulates rows in `intake`	Pipeline-mode tables don’t auto-dedup on PK	Either set up the table’s primary key in the template’s contract (some templates support it) or post-process duplicates via MERGE; use `source_key + extracted_at` as a composite for natural ordering
Tables sees the document but routing decision feels wrong	`document_triage`’s classification confidence isn’t surfaced as a field	Custom-tune the triage by switching to a different template, or hook a downstream Scriptum step that rewrites the decision

Cleanup


TABLE_RULE_ID=$(dodil k3 ingest list -b kb-intake -p "$TABLE_PIPELINE_ID" -o json | jq -r '.rules[0].ruleId')
VECTOR_RULE_ID=$(dodil k3 ingest list -b kb-intake -p "$VECTOR_PIPELINE_ID" -o json | jq -r '.rules[0].ruleId')
 
dodil k3 ingest update "$TABLE_RULE_ID" -b kb-intake --enabled=false
dodil k3 ingest update "$VECTOR_RULE_ID" -b kb-intake --enabled=false
 
dodil k3 ingest delete "$TABLE_RULE_ID" -b kb-intake
dodil k3 ingest delete "$VECTOR_RULE_ID" -b kb-intake
dodil k3 pipeline delete "$TABLE_PIPELINE_ID" -b kb-intake
dodil k3 pipeline delete "$VECTOR_PIPELINE_ID" -b kb-intake
dodil k3 table delete intake -b kb-intake
VEC_COL=$(dodil k3 vector collection get intake_vec -b kb-intake -o json | jq -r '.collectionId')
dodil k3 vector collection delete "$VEC_COL" -b kb-intake
dodil k3 vector store delete -b kb-intake
dodil k3 bucket delete kb-intake