Skip to Content
We are live but in Staging 🎉
PipelinesRecipesDocuments → Warehouse

Documents → Warehouse table

Goal: every document (PDF / text / JSON-as-text) uploaded to a bucket gets entities + PII extracted into a structured warehouse table. Query the table directly, build dashboards, run SQL — no separate ETL.

Template: entity_pii_extraction — K3’s production template for structured entity + PII detection from text/PDF documents. Warehouse-compatible (emits rows).

Shape:

Document upload ──► bucket ──► rule matches ──► entity_pii_extraction Warehouse table (entities + PII rows) SQL / queries

Prerequisites

  • dodil CLI installed and dodil login done — CLI Basics
  • A bucket — we’ll use kb-prod:
    dodil k3 bucket create kb-prod -d "Document intake"
  • A warehouse engine + table to write rows into. See Tables — Quickstart. We’ll assume:
    • warehouse engine: wh-kb-prod
    • table: entities
    • the table’s store_entity_id (capture it in step 1)

1. Capture the destination’s store_entity_id

The pipeline binds to your warehouse table through its store_entity_id:

dodil k3 table list --bucket kb-prod -o json \ | jq -r '.tables[] | select(.name == "entities") | .storeEntityId' export TABLE_ID="<the store_entity_id from above>"

2. Inspect the template

entity_pii_extraction extracts structured entities (people, organizations, locations, dates, identifiers) and detects PII (emails, phones, SSNs, credit cards, …). Output schema is fixed by the template’s ScriptContract:

dodil k3 template get entity_pii_extraction -o json | jq '{name, description, labels}' dodil k3 template get entity_pii_extraction -o json | jq '.contract'

Labels you’ll see: category: analysis, type: entity_extraction, modality: text, pdf, warehouse_compatible: true, vertical: compliance, healthcare, finance, legal, government.

3. Create the pipeline

dodil k3 pipeline create extract-entities \ --bucket kb-prod \ --scriptum entity_pii_extraction \ -o json

Capture pipeline_id:

export PIPELINE_ID="<pipeline_id from response>"

Bind to the warehouse table:

dodil k3 pipeline update "$PIPELINE_ID" --bucket kb-prod \ --store-entity-id "$TABLE_ID"

Verify — storeEntityKind should be "warehouse":

dodil k3 pipeline get "$PIPELINE_ID" --bucket kb-prod -o json \ | jq '{name, scriptumTemplate, storeEntityKind, storeEntityName}'

4. Look up the bucket’s internal source

export SOURCE_ID=$(curl -sS "https://k3.dev.dodil.io/kb-prod/sources" \ -H "Authorization: Bearer $DODIL_TOKEN" \ | jq -r '.sources[] | select(.name == "internal") | .sourceId') echo "internal source = $SOURCE_ID"

5. Create the rule

Cover the document types you want extracted. entity_pii_extraction accepts text + PDF — adjust the patterns to your shape:

# All PDFs under intake/ and all .txt / .json under intake/ dodil k3 ingest add intake-entities \ --bucket kb-prod \ --source "$SOURCE_ID" \ --collection "$PIPELINE_ID" \ --include "intake/**/*.pdf" \ --include "intake/**/*.txt" \ --include "intake/**/*.json"

Capture the rule ID for later filtering:

export RULE_ID="<rule_id from the add response>"

To restrict by MIME (e.g. only specific content types despite the path), use ingest update — the CLI now covers all rule mutations:

dodil k3 ingest update "$RULE_ID" -b kb-prod \ --include-mime application/pdf \ --include-mime text/plain \ --include-mime application/json \ --max-size 10485760

6. Upload test documents

# A PDF curl -sSL https://example.com/sample-contract.pdf -o contract.pdf dodil k3 object create ./contract.pdf -b kb-prod -k intake/contracts/acme-2026.pdf # A JSON ticket (treated as text) cat > ticket.json <<'EOF' { "id": "TKT-1042", "customer": { "name": "Jane Doe", "email": "jane.doe@example.com" }, "subject": "Login issues", "body": "Hi, I can't sign in to my account. My phone is +1-555-0142." } EOF dodil k3 object create ./ticket.json -b kb-prod -k intake/tickets/TKT-1042.json

7. Watch the ingest jobs

# Jobs for this pipeline dodil k3 ingest jobs --bucket kb-prod --pipeline "$PIPELINE_ID" -o json \ | jq '.jobs[] | {object: .object.key, status, rowsWritten, batchesReceived, updatedAt}'

When jobs complete, rowsWritten is the count of structured rows written to the warehouse table:

[ { "object": "intake/contracts/acme-2026.pdf", "status": "INGEST_STATUS_COMPLETED", "rowsWritten": 23, "batchesReceived": 3 }, { "object": "intake/tickets/TKT-1042.json", "status": "INGEST_STATUS_COMPLETED", "rowsWritten": 4, "batchesReceived": 1 } ]

For warehouse pipelines the meaningful counters are rowsWritten + batchesReceivedchunksCreated and embeddingsCreated stay at zero (vector-specific).

8. Query the warehouse

The extracted rows are now in your entities table. Query via the Tables primitive — the template’s ScriptContract documents the output schema (typically columns like source_key, entity_type, entity_value, confidence, start_offset, end_offset, is_pii).

Example:

# All PII detected in the ticket dodil k3 table query --bucket kb-prod --table entities \ --sql "SELECT entity_type, entity_value FROM entities \ WHERE source_key = 'intake/tickets/TKT-1042.json' \ AND is_pii = true" # Top organizations across all extracted contracts dodil k3 table query --bucket kb-prod --table entities \ --sql "SELECT entity_value, COUNT(*) AS n FROM entities \ WHERE entity_type = 'ORGANIZATION' \ AND source_key LIKE 'intake/contracts/%' \ GROUP BY 1 ORDER BY 2 DESC LIMIT 20"

(SQL surface and exact column names are governed by the Tables primitive — confirm the schema there.)

Common gotchas

SymptomCauseFix
Job COMPLETED but rowsWritten == 0Document had no extractable entities (e.g. blank scan)Spot-check with a richer doc; if widespread, inspect the template contract via template get
FAILED on PDFs but text files workOCR / parsing failure on the PDFPre-process with ocr_extraction and chain — or fix the source PDF
pipeline_kind shows "" instead of "warehouse"Pipeline isn’t bound to a store_entitypipeline update --store-entity-id $TABLE_ID
Wrong destination kind (“vector” not “warehouse”)The store_entity_id you bound is a vector collection, not a tableCapture the right store_entity_id from dodil k3 table list, not vector list
Rows appear duplicated after re-uploadK3 re-runs the pipeline on every successful PUT (including overwrites)Idempotency / dedup is the warehouse table’s job — use a primary key in the table schema
Rate of failed jobs spikes after a template upgradeNew template version has stricter inputsRun a small replay batch with TriggerIngestion to confirm before mass replay

Variations

Same recipe shape, different warehouse-compatible template:

GoalTemplateWhat rows look like
Document type / topic / language tagsclassificationOne row per document with classification labels
Summaries + keywordssummarizationSummary text + extracted keywords per document
Sentiment + intentsentiment_intent_analysisSentiment scores, intent, topics, urgency, toxicity
Triage (classify + extract)document_triageCombined output for intake routing
OCR text-onlyocr_extractionRaw text rows per page / region
TranslationtranslationTranslated text rows
Audio → transcriptaudio_transcriptionSpeaker-diarized transcript segments
Object detectionobject_detectionBounding boxes + class labels per detection
Image understandingimage_understandingCombined OCR + detection + LLM reasoning
Code symbols + depscode_intelligenceSymbol-level rows + dependency edges
Product catalog enrichmentproduct_catalog_enrichmentEnriched attribute rows per SKU
Customer reviewsreview_analysisSentiment + toxicity + keyword rows per review
Video surveillancevideo_surveillanceTracked objects + activity classifications

All follow this exact pattern — change the template name and the destination table schema.

See also