Documents → Warehouse table

Goal: every document (PDF / text / JSON-as-text) uploaded to a bucket gets entities + PII extracted into a structured warehouse table. Query the table directly, build dashboards, run SQL — no separate ETL.

Template: entity_pii_extraction — K3’s production template for structured entity + PII detection from text/PDF documents. Warehouse-compatible (emits rows).

Shape:


  Document upload  ──►  bucket  ──►  rule matches  ──►  entity_pii_extraction
                                                              │
                                                              ▼
                                                       Warehouse table
                                                       (entities + PII rows)
                                                              │
                                                              ▼
                                                           SQL / queries

Prerequisites

dodil CLI installed and dodil login done — CLI Basics

A bucket — we’ll use kb-prod:


dodil k3 bucket create kb-prod -d "Document intake"

A warehouse engine + table to write rows into. See Tables — Quickstart. We’ll assume:
- warehouse engine: wh-kb-prod
- table: entities
- the table’s store_entity_id (capture it in step 1)

1. Capture the destination’s `store_entity_id`

The pipeline binds to your warehouse table through its store_entity_id:


dodil k3 table list --bucket kb-prod -o json \
  | jq -r '.tables[] | select(.name == "entities") | .storeEntityId'
 
export TABLE_ID="<the store_entity_id from above>"

2. Inspect the template

entity_pii_extraction extracts structured entities (people, organizations, locations, dates, identifiers) and detects PII (emails, phones, SSNs, credit cards, …). Output schema is fixed by the template’s ScriptContract:


dodil k3 template get entity_pii_extraction -o json | jq '{name, description, labels}'
dodil k3 template get entity_pii_extraction -o json | jq '.contract'

Labels you’ll see: category: analysis, type: entity_extraction, modality: text, pdf, warehouse_compatible: true, vertical: compliance, healthcare, finance, legal, government.

3. Create the pipeline


dodil k3 pipeline create extract-entities \
  --bucket kb-prod \
  --scriptum entity_pii_extraction \
  -o json

Capture pipeline_id:


export PIPELINE_ID="<pipeline_id from response>"

Bind to the warehouse table:


dodil k3 pipeline update "$PIPELINE_ID" --bucket kb-prod \
  --store-entity-id "$TABLE_ID"

Verify — storeEntityKind should be "warehouse":


dodil k3 pipeline get "$PIPELINE_ID" --bucket kb-prod -o json \
  | jq '{name, scriptumTemplate, storeEntityKind, storeEntityName}'

4. Look up the bucket’s internal source


export SOURCE_ID=$(curl -sS "https://k3.dev.dodil.io/kb-prod/sources" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  | jq -r '.sources[] | select(.name == "internal") | .sourceId')
echo "internal source = $SOURCE_ID"

5. Create the rule

Cover the document types you want extracted. entity_pii_extraction accepts text + PDF — adjust the patterns to your shape:


# All PDFs under intake/ and all .txt / .json under intake/
dodil k3 ingest add intake-entities \
  --bucket kb-prod \
  --source "$SOURCE_ID" \
  --collection "$PIPELINE_ID" \
  --include "intake/**/*.pdf" \
  --include "intake/**/*.txt" \
  --include "intake/**/*.json"

Capture the rule ID for later filtering:


export RULE_ID="<rule_id from the add response>"

To restrict by MIME (e.g. only specific content types despite the path), use ingest update — the CLI now covers all rule mutations:


dodil k3 ingest update "$RULE_ID" -b kb-prod \
  --include-mime application/pdf \
  --include-mime text/plain \
  --include-mime application/json \
  --max-size 10485760

6. Upload test documents


# A PDF
curl -sSL https://example.com/sample-contract.pdf -o contract.pdf
dodil k3 object create ./contract.pdf -b kb-prod -k intake/contracts/acme-2026.pdf
 
# A JSON ticket (treated as text)
cat > ticket.json <<'EOF'
{
  "id": "TKT-1042",
  "customer": { "name": "Jane Doe", "email": "jane.doe@example.com" },
  "subject": "Login issues",
  "body": "Hi, I can't sign in to my account. My phone is +1-555-0142."
}
EOF
dodil k3 object create ./ticket.json -b kb-prod -k intake/tickets/TKT-1042.json

7. Watch the ingest jobs


# Jobs for this pipeline
dodil k3 ingest jobs --bucket kb-prod --pipeline "$PIPELINE_ID" -o json \
  | jq '.jobs[] | {object: .object.key, status, rowsWritten, batchesReceived, updatedAt}'

When jobs complete, rowsWritten is the count of structured rows written to the warehouse table:


[
  {
    "object": "intake/contracts/acme-2026.pdf",
    "status": "INGEST_STATUS_COMPLETED",
    "rowsWritten": 23,
    "batchesReceived": 3
  },
  {
    "object": "intake/tickets/TKT-1042.json",
    "status": "INGEST_STATUS_COMPLETED",
    "rowsWritten": 4,
    "batchesReceived": 1
  }
]

For warehouse pipelines the meaningful counters are rowsWritten + batchesReceived — chunksCreated and embeddingsCreated stay at zero (vector-specific).

8. Query the warehouse

The extracted rows are now in your entities table. Query via the Tables primitive — the template’s ScriptContract documents the output schema (typically columns like source_key, entity_type, entity_value, confidence, start_offset, end_offset, is_pii).

Example:


# All PII detected in the ticket
dodil k3 table query --bucket kb-prod --table entities \
  --sql "SELECT entity_type, entity_value FROM entities \
         WHERE source_key = 'intake/tickets/TKT-1042.json' \
           AND is_pii = true"
 
# Top organizations across all extracted contracts
dodil k3 table query --bucket kb-prod --table entities \
  --sql "SELECT entity_value, COUNT(*) AS n FROM entities \
         WHERE entity_type = 'ORGANIZATION' \
           AND source_key LIKE 'intake/contracts/%' \
         GROUP BY 1 ORDER BY 2 DESC LIMIT 20"

(SQL surface and exact column names are governed by the Tables primitive — confirm the schema there.)

Common gotchas

Symptom	Cause	Fix
Job `COMPLETED` but `rowsWritten == 0`	Document had no extractable entities (e.g. blank scan)	Spot-check with a richer doc; if widespread, inspect the template contract via `template get`
`FAILED` on PDFs but text files work	OCR / parsing failure on the PDF	Pre-process with `ocr_extraction` and chain — or fix the source PDF
`pipeline_kind` shows `""` instead of `"warehouse"`	Pipeline isn’t bound to a `store_entity`	`pipeline update --store-entity-id $TABLE_ID`
Wrong destination kind (“vector” not “warehouse”)	The `store_entity_id` you bound is a vector collection, not a table	Capture the right `store_entity_id` from `dodil k3 table list`, not `vector list`
Rows appear duplicated after re-upload	K3 re-runs the pipeline on every successful PUT (including overwrites)	Idempotency / dedup is the warehouse table’s job — use a primary key in the table schema
Rate of failed jobs spikes after a template upgrade	New template version has stricter inputs	Run a small replay batch with `TriggerIngestion` to confirm before mass replay

Variations

Same recipe shape, different warehouse-compatible template:

Goal	Template	What rows look like
Document type / topic / language tags	`classification`	One row per document with classification labels
Summaries + keywords	`summarization`	Summary text + extracted keywords per document
Sentiment + intent	`sentiment_intent_analysis`	Sentiment scores, intent, topics, urgency, toxicity
Triage (classify + extract)	`document_triage`	Combined output for intake routing
OCR text-only	`ocr_extraction`	Raw text rows per page / region
Translation	`translation`	Translated text rows
Audio → transcript	`audio_transcription`	Speaker-diarized transcript segments
Object detection	`object_detection`	Bounding boxes + class labels per detection
Image understanding	`image_understanding`	Combined OCR + detection + LLM reasoning
Code symbols + deps	`code_intelligence`	Symbol-level rows + dependency edges
Product catalog enrichment	`product_catalog_enrichment`	Enriched attribute rows per SKU
Customer reviews	`review_analysis`	Sentiment + toxicity + keyword rows per review
Video surveillance	`video_surveillance`	Tracked objects + activity classifications

All follow this exact pattern — change the template name and the destination table schema.