Pipeline-bound Table
Goal: drop unstructured documents into a bucket → K3 extracts structured rows into a Tables table automatically. No manual INSERTs, no schema declaration — the Scriptum template owns both.
Template used: entity_pii_extraction — extracts named entities + PII from text/PDF documents. The pattern works identically for any warehouse-compatible template (document_triage, classification, summarization, sentiment_intent_analysis, ocr_extraction, audio_transcription, object_detection, image_understanding, code_intelligence, product_catalog_enrichment, review_analysis, translation, video_surveillance).
Shape:
document upload ──► bucket ──► auto-generated ingest rule matches
│
▼
pipeline runs
(Scriptum template extracts rows)
│
▼
Tables table
(schema lazy-materialized
on first ingest)
│
▼
SQL queriesPrerequisites
dodilCLI +dodil logindone — CLI Basics.- A bucket —
kb-prod:dodil k3 bucket create kb-prod -d "Document intake" - The Tables engine is already enabled on the bucket (auto-on from
CreateBucket). - The Pipelines primitive is wired — confirm via
dodil k3 template listreturning results.
1. Browse warehouse-compatible templates
The Tables-pillar CLI returns only the warehouse-compatible templates (pre-filtered server-side). The catalog is org-scoped — no -b flag needed:
# All warehouse-compatible templates
dodil k3 table templates -o json \
| jq '.templates[] | {id, name, modalities, acceptedExtensions}'
# Or narrow by category / search / label — all server-side
dodil k3 table templates --category analysis -o json
dodil k3 table templates --search pii -o json
dodil k3 table templates --label modality=pdf -o jsonPick entity_pii_extraction — extracts entities + PII detection from text/PDF documents. Modalities: text, pdf. Accepted extensions: pdf, txt, docx.
The full catalog (14 warehouse templates) lives at API Reference → Templates. For the broader catalog across all pillars (incl. non-warehouse — vector index/search, etc.), use
dodil k3 template list.
2. Inspect the chosen template’s contract
Look at the template’s typed ScriptContract to understand what it does and what options it accepts:
dodil k3 template get entity_pii_extraction -o json \
| jq '{name, description, labels, acceptedExtensions, acceptedContentTypes}'You’ll see labels like:
{
"category": "analysis",
"type": "entity_extraction",
"pipeline": "ingestion",
"modality": "text, pdf",
"warehouse_compatible": "true",
"vertical": "compliance, healthcare, finance, legal, government"
}output_columns is deliberately empty for warehouse templates — schema materializes lazily on first ingest. K3 won’t know the exact column layout until the first document flows through.
3. Create the pipeline-bound table
dodil k3 table create entities --bucket kb-prod \
--description "Auto-extracted entities + PII from intake/" \
--source pipeline_generated \
--pipeline-template-id entity_pii_extractionK3 atomically creates three things for you:
| Created | What it is | Where it lives |
|---|---|---|
entities table | Empty Delta table; schema materializes on first ingest | Tables service (Table.source = TABLE_SOURCE_PIPELINE_GENERATED) |
| A pipeline | Bound to entity_pii_extraction template + the entities table as destination | Pipelines service (Pipeline) |
| An ingest rule | Globs derived from the template’s acceptedExtensions (**/*.pdf, **/*.txt, etc.) | Pipelines service (IngestRule) |
Verify the table row:
dodil k3 table get entities --bucket kb-prod -o json \
| jq '{name, source, pipelineId, pipelineName, status}'Expect source: TABLE_SOURCE_PIPELINE_GENERATED and pipelineId: pipe_... populated.
Optional
folder_prefix: the CLI doesn’t expose it, but the API does — pass"folderPrefix": "intake"to scope the auto-generated rule’s globs tointake/**/*.pdfrather than
4. Inspect the auto-generated rule
Find the rule via the Pipelines CLI — ingest list now supports -p to filter by pipeline:
PIPELINE_ID=$(dodil k3 table get entities --bucket kb-prod -o json | jq -r '.pipelineId')
dodil k3 ingest list -b kb-prod -p "$PIPELINE_ID" -o json \
| jq '.rules[] | {ruleId, name, includePatterns, includeMimeTypes, enabled}'Expect output like:
{
"ruleId": "rule_a1b2…",
"name": "auto-entity_pii_extraction",
"includePatterns": ["**/*.pdf", "**/*.txt", "**/*.docx"],
"includeMimeTypes": ["application/pdf", "text/plain"],
"enabled": true
}This is a regular IngestRule — manage it via dodil k3 ingest. For instance to pause ingestion without deleting the table:
RULE_ID="<rule_id from above>"
dodil k3 ingest update "$RULE_ID" -b kb-prod --enabled=false
# Re-enable later
dodil k3 ingest update "$RULE_ID" -b kb-prod --enabled=true5. Upload test documents
# A PDF
curl -sSL https://example.com/sample-contract.pdf -o contract.pdf
dodil k3 object create ./contract.pdf -b kb-prod -k contracts/acme-2026.pdf
# A text-document-as-JSON (treated as text)
cat > ticket.json <<'EOF'
{
"id": "TKT-1042",
"customer": { "name": "Jane Doe", "email": "jane.doe@example.com" },
"subject": "Login issues",
"body": "Hi, I can't sign in to my account. My phone is +1-555-0142."
}
EOF
dodil k3 object create ./ticket.json -b kb-prod -k tickets/TKT-1042.jsonEach upload fires through the auto-generated rule → spawns an ingest job → runs entity_pii_extraction → writes rows to the entities table.
Note:
entity_pii_extractionacceptstxt. JSON is text-like and gets accepted; the template treats the JSON as raw text and extracts entities from the string content. For structured-JSON-aware templates (preserving field structure), use the API with a customentity_extractionScriptum script — but that’s beyond this recipe.
6. Watch the ingest jobs
# All jobs for the bound pipeline
dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json \
| jq '.jobs[] | {jobId, object: .object.key, status, rowsWritten, updatedAt}'Status progression for a successful run:
PENDING ─► PROCESSING ─► COMPLETED
│
▼ transient error
RETRYING ─► PROCESSING ─► COMPLETED (automatic retry)
│
▼ permanent error (max attempts)
FAILEDWhen COMPLETED, rowsWritten tells you how many extracted rows landed in the table:
{
"jobId": "job_a1b2…",
"object": { "bucket": "kb-prod", "key": "contracts/acme-2026.pdf" },
"status": "INGEST_STATUS_COMPLETED",
"rowsWritten": 23,
"pipelineKind": "warehouse"
}For replay / retry semantics (failed jobs, partial runs), see Pipelines → Replay & Retry.
7. Describe the table — see the materialized schema
After first successful ingest, the schema is now known. Describe to see what entity_pii_extraction actually produces:
dodil k3 table describe entities --bucket kb-prod -o json \
| jq '{
tableName, version,
columns: [.columns[] | {name, type, nullable}],
rowCount: .rowCount
}'Sample output (template-defined columns):
{
"tableName": "entities",
"version": "1",
"columns": [
{"name": "source_key", "type": "COLUMN_TYPE_STRING", "nullable": false},
{"name": "entity_type", "type": "COLUMN_TYPE_STRING", "nullable": false},
{"name": "entity_value", "type": "COLUMN_TYPE_STRING", "nullable": false},
{"name": "confidence", "type": "COLUMN_TYPE_DOUBLE", "nullable": true},
{"name": "start_offset", "type": "COLUMN_TYPE_INT", "nullable": true},
{"name": "end_offset", "type": "COLUMN_TYPE_INT", "nullable": true},
{"name": "is_pii", "type": "COLUMN_TYPE_BOOLEAN", "nullable": false},
{"name": "extracted_at", "type": "COLUMN_TYPE_TIMESTAMP", "nullable": false}
],
"rowCount": "23"
}Exact columns vary by template — every Scriptum template defines its own output schema. The template’s typed ScriptContract (from step 2’s GetTemplate call) describes what to expect.
8. Query the auto-extracted rows
# All PII detected in a specific document
dodil k3 table query \
"SELECT entity_type, entity_value FROM entities
WHERE source_key = 'tickets/TKT-1042.json' AND is_pii = true" \
--bucket kb-prod
# Top organizations across all extracted contracts
dodil k3 table query \
"SELECT entity_value, COUNT(*) AS n FROM entities
WHERE entity_type = 'ORGANIZATION'
AND source_key LIKE 'contracts/%'
GROUP BY 1 ORDER BY 2 DESC LIMIT 20" \
--bucket kb-prod(Exact entity_type enum values are template-defined — typical values include PERSON, ORGANIZATION, LOCATION, DATE, EMAIL, PHONE, SSN, CREDIT_CARD.)
Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
table create succeeds but describe shows no pipelineId | Forgot --source pipeline_generated; created a manual table by mistake | Delete + recreate with --source pipeline_generated --pipeline-template-id ... |
| Documents upload but no ingest jobs spawn | Auto-generated rule’s globs don’t match your paths | Check the rule (step 4) — confirm includePatterns covers your path; remember **/*.pdf matches recursively but *.pdf doesn’t |
Jobs COMPLETED but entities table is empty in queries | Schema materialized on first ingest; reads are EVENTUAL and the rows are still in the write log | Run dodil k3 table compact entities --bucket kb-prod |
describe returns columns but rowCount is 0 | Template produced no extractable entities (blank document, unsupported encoding) | Spot-check with a richer document; if widespread, inspect the template’s contract via GetTemplate for required input fields |
Some jobs FAILED while others COMPLETED | Template can’t process specific documents (corrupt PDF, wrong encoding, unsupported MIME) | Read job.error / job.errorDetails; for permanent failures, exclude the file or pre-process. See Pipelines → Replay & Retry. |
Wrong destination kind (vector not warehouse) | Wrong template — picked a non-warehouse template | Confirm warehouse_compatible: true on the template before binding; non-warehouse templates target the Vector primitive |
| Re-uploaded same document → duplicate rows | K3 re-runs the pipeline on every successful PUT (incl. overwrites) | Idempotency is the table’s job — use a primary key in the table schema if you want dedup (but pipeline-bound tables use template-defined schemas; this means choosing a template that emits a stable PK or post-processing rows downstream) |
Variations — other warehouse-compatible templates
Same recipe shape, different template:
| Goal | Template | Sample rows produced |
|---|---|---|
| Document type / topic / language tags | classification | One row per document with classification labels |
| Summary + keywords | summarization | Multi-level summaries + extracted keywords per document |
| Sentiment + intent + toxicity | sentiment_intent_analysis | Sentiment scores, intent labels, urgency, toxicity per document |
| Triage (classify + extract) | document_triage | Combined classification + entity output for intake routing |
| OCR-only (no analysis) | ocr_extraction | Raw text rows per page / region |
| Translate documents | translation | Translated text rows |
| Audio → transcripts | audio_transcription | Speaker-diarized transcript segments |
| Object detection in images | object_detection | Bounding boxes + class labels per detection |
| Full image analysis | image_understanding | OCR text + detections + LLM-reasoned description |
| Code symbols + dependencies | code_intelligence | Symbol-level rows + dependency edges |
| Product attribute enrichment | product_catalog_enrichment | Enriched attributes per SKU |
| Customer review analytics | review_analysis | Sentiment + toxicity + keyword rows per review |
| Video surveillance | video_surveillance | Tracked objects + activity classifications |
All follow the same CreateTablePipeline → upload → query shape. Swap the --pipeline-template-id and the table schema materializes from the chosen template on first ingest.
See also
- Manual Table — same primitive but you own the schema
- Pipelines — Recipes → Documents → Warehouse — the Pipelines-flavored telling of the same flow
- API Reference → CreateTablePipeline — including
folder_prefix - Pipelines → Templates → The catalog — full 24-template list (warehouse + vector)
- Pipelines → Recipes → Replay & Retry — recover from failed ingests