Documents → Warehouse table
Goal: every document (PDF / text / JSON-as-text) uploaded to a bucket gets entities + PII extracted into a structured warehouse table. Query the table directly, build dashboards, run SQL — no separate ETL.
Template: entity_pii_extraction — K3’s production template for structured entity + PII detection from text/PDF documents. Warehouse-compatible (emits rows).
Shape:
Document upload ──► bucket ──► rule matches ──► entity_pii_extraction
│
▼
Warehouse table
(entities + PII rows)
│
▼
SQL / queriesPrerequisites
dodilCLI installed anddodil logindone — CLI Basics- A bucket — we’ll use
kb-prod:dodil k3 bucket create kb-prod -d "Document intake" - A warehouse engine + table to write rows into. See Tables — Quickstart. We’ll assume:
- warehouse engine:
wh-kb-prod - table:
entities - the table’s
store_entity_id(capture it in step 1)
- warehouse engine:
1. Capture the destination’s store_entity_id
The pipeline binds to your warehouse table through its store_entity_id:
dodil k3 table list --bucket kb-prod -o json \
| jq -r '.tables[] | select(.name == "entities") | .storeEntityId'
export TABLE_ID="<the store_entity_id from above>"2. Inspect the template
entity_pii_extraction extracts structured entities (people, organizations, locations, dates, identifiers) and detects PII (emails, phones, SSNs, credit cards, …). Output schema is fixed by the template’s ScriptContract:
dodil k3 template get entity_pii_extraction -o json | jq '{name, description, labels}'
dodil k3 template get entity_pii_extraction -o json | jq '.contract'Labels you’ll see: category: analysis, type: entity_extraction, modality: text, pdf, warehouse_compatible: true, vertical: compliance, healthcare, finance, legal, government.
3. Create the pipeline
dodil k3 pipeline create extract-entities \
--bucket kb-prod \
--scriptum entity_pii_extraction \
-o jsonCapture pipeline_id:
export PIPELINE_ID="<pipeline_id from response>"Bind to the warehouse table:
dodil k3 pipeline update "$PIPELINE_ID" --bucket kb-prod \
--store-entity-id "$TABLE_ID"Verify — storeEntityKind should be "warehouse":
dodil k3 pipeline get "$PIPELINE_ID" --bucket kb-prod -o json \
| jq '{name, scriptumTemplate, storeEntityKind, storeEntityName}'4. Look up the bucket’s internal source
export SOURCE_ID=$(curl -sS "https://k3.dev.dodil.io/kb-prod/sources" \
-H "Authorization: Bearer $DODIL_TOKEN" \
| jq -r '.sources[] | select(.name == "internal") | .sourceId')
echo "internal source = $SOURCE_ID"5. Create the rule
Cover the document types you want extracted. entity_pii_extraction accepts text + PDF — adjust the patterns to your shape:
# All PDFs under intake/ and all .txt / .json under intake/
dodil k3 ingest add intake-entities \
--bucket kb-prod \
--source "$SOURCE_ID" \
--collection "$PIPELINE_ID" \
--include "intake/**/*.pdf" \
--include "intake/**/*.txt" \
--include "intake/**/*.json"Capture the rule ID for later filtering:
export RULE_ID="<rule_id from the add response>"To restrict by MIME (e.g. only specific content types despite the path), use ingest update — the CLI now covers all rule mutations:
dodil k3 ingest update "$RULE_ID" -b kb-prod \
--include-mime application/pdf \
--include-mime text/plain \
--include-mime application/json \
--max-size 104857606. Upload test documents
# A PDF
curl -sSL https://example.com/sample-contract.pdf -o contract.pdf
dodil k3 object create ./contract.pdf -b kb-prod -k intake/contracts/acme-2026.pdf
# A JSON ticket (treated as text)
cat > ticket.json <<'EOF'
{
"id": "TKT-1042",
"customer": { "name": "Jane Doe", "email": "jane.doe@example.com" },
"subject": "Login issues",
"body": "Hi, I can't sign in to my account. My phone is +1-555-0142."
}
EOF
dodil k3 object create ./ticket.json -b kb-prod -k intake/tickets/TKT-1042.json7. Watch the ingest jobs
# Jobs for this pipeline
dodil k3 ingest jobs --bucket kb-prod --pipeline "$PIPELINE_ID" -o json \
| jq '.jobs[] | {object: .object.key, status, rowsWritten, batchesReceived, updatedAt}'When jobs complete, rowsWritten is the count of structured rows written to the warehouse table:
[
{
"object": "intake/contracts/acme-2026.pdf",
"status": "INGEST_STATUS_COMPLETED",
"rowsWritten": 23,
"batchesReceived": 3
},
{
"object": "intake/tickets/TKT-1042.json",
"status": "INGEST_STATUS_COMPLETED",
"rowsWritten": 4,
"batchesReceived": 1
}
]For warehouse pipelines the meaningful counters are rowsWritten + batchesReceived — chunksCreated and embeddingsCreated stay at zero (vector-specific).
8. Query the warehouse
The extracted rows are now in your entities table. Query via the Tables primitive — the template’s ScriptContract documents the output schema (typically columns like source_key, entity_type, entity_value, confidence, start_offset, end_offset, is_pii).
Example:
# All PII detected in the ticket
dodil k3 table query --bucket kb-prod --table entities \
--sql "SELECT entity_type, entity_value FROM entities \
WHERE source_key = 'intake/tickets/TKT-1042.json' \
AND is_pii = true"
# Top organizations across all extracted contracts
dodil k3 table query --bucket kb-prod --table entities \
--sql "SELECT entity_value, COUNT(*) AS n FROM entities \
WHERE entity_type = 'ORGANIZATION' \
AND source_key LIKE 'intake/contracts/%' \
GROUP BY 1 ORDER BY 2 DESC LIMIT 20"(SQL surface and exact column names are governed by the Tables primitive — confirm the schema there.)
Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
Job COMPLETED but rowsWritten == 0 | Document had no extractable entities (e.g. blank scan) | Spot-check with a richer doc; if widespread, inspect the template contract via template get |
FAILED on PDFs but text files work | OCR / parsing failure on the PDF | Pre-process with ocr_extraction and chain — or fix the source PDF |
pipeline_kind shows "" instead of "warehouse" | Pipeline isn’t bound to a store_entity | pipeline update --store-entity-id $TABLE_ID |
| Wrong destination kind (“vector” not “warehouse”) | The store_entity_id you bound is a vector collection, not a table | Capture the right store_entity_id from dodil k3 table list, not vector list |
| Rows appear duplicated after re-upload | K3 re-runs the pipeline on every successful PUT (including overwrites) | Idempotency / dedup is the warehouse table’s job — use a primary key in the table schema |
| Rate of failed jobs spikes after a template upgrade | New template version has stricter inputs | Run a small replay batch with TriggerIngestion to confirm before mass replay |
Variations
Same recipe shape, different warehouse-compatible template:
| Goal | Template | What rows look like |
|---|---|---|
| Document type / topic / language tags | classification | One row per document with classification labels |
| Summaries + keywords | summarization | Summary text + extracted keywords per document |
| Sentiment + intent | sentiment_intent_analysis | Sentiment scores, intent, topics, urgency, toxicity |
| Triage (classify + extract) | document_triage | Combined output for intake routing |
| OCR text-only | ocr_extraction | Raw text rows per page / region |
| Translation | translation | Translated text rows |
| Audio → transcript | audio_transcription | Speaker-diarized transcript segments |
| Object detection | object_detection | Bounding boxes + class labels per detection |
| Image understanding | image_understanding | Combined OCR + detection + LLM reasoning |
| Code symbols + deps | code_intelligence | Symbol-level rows + dependency edges |
| Product catalog enrichment | product_catalog_enrichment | Enriched attribute rows per SKU |
| Customer reviews | review_analysis | Sentiment + toxicity + keyword rows per review |
| Video surveillance | video_surveillance | Tracked objects + activity classifications |
All follow this exact pattern — change the template name and the destination table schema.
See also
- PDF → Vector — the vector-collection variant
- Replay & Retry — recover from failures
- Templates → The catalog — full descriptions + modalities
- Tables — the destination primitive (HTAP on Delta Lake)