Skip to Content
We are live but in Staging 🎉
TablesRecipesPipeline-bound Table

Pipeline-bound Table

Goal: drop unstructured documents into a bucket → K3 extracts structured rows into a Tables table automatically. No manual INSERTs, no schema declaration — the Scriptum template owns both.

Template used: entity_pii_extraction — extracts named entities + PII from text/PDF documents. The pattern works identically for any warehouse-compatible template (document_triage, classification, summarization, sentiment_intent_analysis, ocr_extraction, audio_transcription, object_detection, image_understanding, code_intelligence, product_catalog_enrichment, review_analysis, translation, video_surveillance).

Shape:

document upload ──► bucket ──► auto-generated ingest rule matches pipeline runs (Scriptum template extracts rows) Tables table (schema lazy-materialized on first ingest) SQL queries

Prerequisites

  • dodil CLI + dodil login done — CLI Basics.
  • A bucket — kb-prod:
    dodil k3 bucket create kb-prod -d "Document intake"
  • The Tables engine is already enabled on the bucket (auto-on from CreateBucket).
  • The Pipelines primitive is wired — confirm via dodil k3 template list returning results.

1. Browse warehouse-compatible templates

The Tables-pillar CLI returns only the warehouse-compatible templates (pre-filtered server-side). The catalog is org-scoped — no -b flag needed:

# All warehouse-compatible templates dodil k3 table templates -o json \ | jq '.templates[] | {id, name, modalities, acceptedExtensions}' # Or narrow by category / search / label — all server-side dodil k3 table templates --category analysis -o json dodil k3 table templates --search pii -o json dodil k3 table templates --label modality=pdf -o json

Pick entity_pii_extraction — extracts entities + PII detection from text/PDF documents. Modalities: text, pdf. Accepted extensions: pdf, txt, docx.

The full catalog (14 warehouse templates) lives at API Reference → Templates. For the broader catalog across all pillars (incl. non-warehouse — vector index/search, etc.), use dodil k3 template list.

2. Inspect the chosen template’s contract

Look at the template’s typed ScriptContract to understand what it does and what options it accepts:

dodil k3 template get entity_pii_extraction -o json \ | jq '{name, description, labels, acceptedExtensions, acceptedContentTypes}'

You’ll see labels like:

{ "category": "analysis", "type": "entity_extraction", "pipeline": "ingestion", "modality": "text, pdf", "warehouse_compatible": "true", "vertical": "compliance, healthcare, finance, legal, government" }

output_columns is deliberately empty for warehouse templates — schema materializes lazily on first ingest. K3 won’t know the exact column layout until the first document flows through.

3. Create the pipeline-bound table

dodil k3 table create entities --bucket kb-prod \ --description "Auto-extracted entities + PII from intake/" \ --source pipeline_generated \ --pipeline-template-id entity_pii_extraction

K3 atomically creates three things for you:

CreatedWhat it isWhere it lives
entities tableEmpty Delta table; schema materializes on first ingestTables service (Table.source = TABLE_SOURCE_PIPELINE_GENERATED)
A pipelineBound to entity_pii_extraction template + the entities table as destinationPipelines service (Pipeline)
An ingest ruleGlobs derived from the template’s acceptedExtensions (**/*.pdf, **/*.txt, etc.)Pipelines service (IngestRule)

Verify the table row:

dodil k3 table get entities --bucket kb-prod -o json \ | jq '{name, source, pipelineId, pipelineName, status}'

Expect source: TABLE_SOURCE_PIPELINE_GENERATED and pipelineId: pipe_... populated.

Optional folder_prefix: the CLI doesn’t expose it, but the API does — pass "folderPrefix": "intake" to scope the auto-generated rule’s globs to intake/**/*.pdf rather than **/*.pdf. Useful when multiple pipeline-tables accept the same extension and you want to keep them isolated. See API Reference → CreateTablePipeline.

4. Inspect the auto-generated rule

Find the rule via the Pipelines CLI — ingest list now supports -p to filter by pipeline:

PIPELINE_ID=$(dodil k3 table get entities --bucket kb-prod -o json | jq -r '.pipelineId') dodil k3 ingest list -b kb-prod -p "$PIPELINE_ID" -o json \ | jq '.rules[] | {ruleId, name, includePatterns, includeMimeTypes, enabled}'

Expect output like:

{ "ruleId": "rule_a1b2…", "name": "auto-entity_pii_extraction", "includePatterns": ["**/*.pdf", "**/*.txt", "**/*.docx"], "includeMimeTypes": ["application/pdf", "text/plain"], "enabled": true }

This is a regular IngestRule — manage it via dodil k3 ingest. For instance to pause ingestion without deleting the table:

RULE_ID="<rule_id from above>" dodil k3 ingest update "$RULE_ID" -b kb-prod --enabled=false # Re-enable later dodil k3 ingest update "$RULE_ID" -b kb-prod --enabled=true

5. Upload test documents

# A PDF curl -sSL https://example.com/sample-contract.pdf -o contract.pdf dodil k3 object create ./contract.pdf -b kb-prod -k contracts/acme-2026.pdf # A text-document-as-JSON (treated as text) cat > ticket.json <<'EOF' { "id": "TKT-1042", "customer": { "name": "Jane Doe", "email": "jane.doe@example.com" }, "subject": "Login issues", "body": "Hi, I can't sign in to my account. My phone is +1-555-0142." } EOF dodil k3 object create ./ticket.json -b kb-prod -k tickets/TKT-1042.json

Each upload fires through the auto-generated rule → spawns an ingest job → runs entity_pii_extraction → writes rows to the entities table.

Note: entity_pii_extraction accepts pdf + txt. JSON is text-like and gets accepted; the template treats the JSON as raw text and extracts entities from the string content. For structured-JSON-aware templates (preserving field structure), use the API with a custom entity_extraction Scriptum script — but that’s beyond this recipe.

6. Watch the ingest jobs

# All jobs for the bound pipeline dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json \ | jq '.jobs[] | {jobId, object: .object.key, status, rowsWritten, updatedAt}'

Status progression for a successful run:

PENDING ─► PROCESSING ─► COMPLETED ▼ transient error RETRYING ─► PROCESSING ─► COMPLETED (automatic retry) ▼ permanent error (max attempts) FAILED

When COMPLETED, rowsWritten tells you how many extracted rows landed in the table:

{ "jobId": "job_a1b2…", "object": { "bucket": "kb-prod", "key": "contracts/acme-2026.pdf" }, "status": "INGEST_STATUS_COMPLETED", "rowsWritten": 23, "pipelineKind": "warehouse" }

For replay / retry semantics (failed jobs, partial runs), see Pipelines → Replay & Retry.

7. Describe the table — see the materialized schema

After first successful ingest, the schema is now known. Describe to see what entity_pii_extraction actually produces:

dodil k3 table describe entities --bucket kb-prod -o json \ | jq '{ tableName, version, columns: [.columns[] | {name, type, nullable}], rowCount: .rowCount }'

Sample output (template-defined columns):

{ "tableName": "entities", "version": "1", "columns": [ {"name": "source_key", "type": "COLUMN_TYPE_STRING", "nullable": false}, {"name": "entity_type", "type": "COLUMN_TYPE_STRING", "nullable": false}, {"name": "entity_value", "type": "COLUMN_TYPE_STRING", "nullable": false}, {"name": "confidence", "type": "COLUMN_TYPE_DOUBLE", "nullable": true}, {"name": "start_offset", "type": "COLUMN_TYPE_INT", "nullable": true}, {"name": "end_offset", "type": "COLUMN_TYPE_INT", "nullable": true}, {"name": "is_pii", "type": "COLUMN_TYPE_BOOLEAN", "nullable": false}, {"name": "extracted_at", "type": "COLUMN_TYPE_TIMESTAMP", "nullable": false} ], "rowCount": "23" }

Exact columns vary by template — every Scriptum template defines its own output schema. The template’s typed ScriptContract (from step 2’s GetTemplate call) describes what to expect.

8. Query the auto-extracted rows

# All PII detected in a specific document dodil k3 table query \ "SELECT entity_type, entity_value FROM entities WHERE source_key = 'tickets/TKT-1042.json' AND is_pii = true" \ --bucket kb-prod # Top organizations across all extracted contracts dodil k3 table query \ "SELECT entity_value, COUNT(*) AS n FROM entities WHERE entity_type = 'ORGANIZATION' AND source_key LIKE 'contracts/%' GROUP BY 1 ORDER BY 2 DESC LIMIT 20" \ --bucket kb-prod

(Exact entity_type enum values are template-defined — typical values include PERSON, ORGANIZATION, LOCATION, DATE, EMAIL, PHONE, SSN, CREDIT_CARD.)

Common gotchas

SymptomCauseFix
table create succeeds but describe shows no pipelineIdForgot --source pipeline_generated; created a manual table by mistakeDelete + recreate with --source pipeline_generated --pipeline-template-id ...
Documents upload but no ingest jobs spawnAuto-generated rule’s globs don’t match your pathsCheck the rule (step 4) — confirm includePatterns covers your path; remember **/*.pdf matches recursively but *.pdf doesn’t
Jobs COMPLETED but entities table is empty in queriesSchema materialized on first ingest; reads are EVENTUAL and the rows are still in the write logRun dodil k3 table compact entities --bucket kb-prod
describe returns columns but rowCount is 0Template produced no extractable entities (blank document, unsupported encoding)Spot-check with a richer document; if widespread, inspect the template’s contract via GetTemplate for required input fields
Some jobs FAILED while others COMPLETEDTemplate can’t process specific documents (corrupt PDF, wrong encoding, unsupported MIME)Read job.error / job.errorDetails; for permanent failures, exclude the file or pre-process. See Pipelines → Replay & Retry.
Wrong destination kind (vector not warehouse)Wrong template — picked a non-warehouse templateConfirm warehouse_compatible: true on the template before binding; non-warehouse templates target the Vector primitive
Re-uploaded same document → duplicate rowsK3 re-runs the pipeline on every successful PUT (incl. overwrites)Idempotency is the table’s job — use a primary key in the table schema if you want dedup (but pipeline-bound tables use template-defined schemas; this means choosing a template that emits a stable PK or post-processing rows downstream)

Variations — other warehouse-compatible templates

Same recipe shape, different template:

GoalTemplateSample rows produced
Document type / topic / language tagsclassificationOne row per document with classification labels
Summary + keywordssummarizationMulti-level summaries + extracted keywords per document
Sentiment + intent + toxicitysentiment_intent_analysisSentiment scores, intent labels, urgency, toxicity per document
Triage (classify + extract)document_triageCombined classification + entity output for intake routing
OCR-only (no analysis)ocr_extractionRaw text rows per page / region
Translate documentstranslationTranslated text rows
Audio → transcriptsaudio_transcriptionSpeaker-diarized transcript segments
Object detection in imagesobject_detectionBounding boxes + class labels per detection
Full image analysisimage_understandingOCR text + detections + LLM-reasoned description
Code symbols + dependenciescode_intelligenceSymbol-level rows + dependency edges
Product attribute enrichmentproduct_catalog_enrichmentEnriched attributes per SKU
Customer review analyticsreview_analysisSentiment + toxicity + keyword rows per review
Video surveillancevideo_surveillanceTracked objects + activity classifications

All follow the same CreateTablePipeline → upload → query shape. Swap the --pipeline-template-id and the table schema materializes from the chosen template on first ingest.

See also