Pipeline-bound Table

Goal: drop unstructured documents into a bucket → K3 extracts structured rows into a Tables table automatically. No manual INSERTs, no schema declaration — the Scriptum template owns both.

Template used: entity_pii_extraction — extracts named entities + PII from text/PDF documents. The pattern works identically for any warehouse-compatible template (document_triage, classification, summarization, sentiment_intent_analysis, ocr_extraction, audio_transcription, object_detection, image_understanding, code_intelligence, product_catalog_enrichment, review_analysis, translation, video_surveillance).

Shape:


   document upload  ──►  bucket  ──►  auto-generated ingest rule matches
                                                  │
                                                  ▼
                                         pipeline runs
                                    (Scriptum template extracts rows)
                                                  │
                                                  ▼
                                         Tables table
                                  (schema lazy-materialized
                                   on first ingest)
                                                  │
                                                  ▼
                                              SQL queries

Prerequisites

dodil CLI + dodil login done — CLI Basics.

A bucket — kb-prod:


dodil k3 bucket create kb-prod -d "Document intake"

The Tables engine is already enabled on the bucket (auto-on from CreateBucket).
The Pipelines primitive is wired — confirm via dodil k3 template list returning results.

1. Browse warehouse-compatible templates

The Tables-pillar CLI returns only the warehouse-compatible templates (pre-filtered server-side). The catalog is org-scoped — no -b flag needed:


# All warehouse-compatible templates
dodil k3 table templates -o json \
  | jq '.templates[] | {id, name, modalities, acceptedExtensions}'
 
# Or narrow by category / search / label — all server-side
dodil k3 table templates --category analysis -o json
dodil k3 table templates --search pii -o json
dodil k3 table templates --label modality=pdf -o json

Pick entity_pii_extraction — extracts entities + PII detection from text/PDF documents. Modalities: text, pdf. Accepted extensions: pdf, txt, docx.

The full catalog (14 warehouse templates) lives at API Reference → Templates. For the broader catalog across all pillars (incl. non-warehouse — vector index/search, etc.), use dodil k3 template list.

2. Inspect the chosen template’s contract

Look at the template’s typed ScriptContract to understand what it does and what options it accepts:


dodil k3 template get entity_pii_extraction -o json \
  | jq '{name, description, labels, acceptedExtensions, acceptedContentTypes}'

You’ll see labels like:


{
  "category": "analysis",
  "type": "entity_extraction",
  "pipeline": "ingestion",
  "modality": "text, pdf",
  "warehouse_compatible": "true",
  "vertical": "compliance, healthcare, finance, legal, government"
}

output_columns is deliberately empty for warehouse templates — schema materializes lazily on first ingest. K3 won’t know the exact column layout until the first document flows through.

3. Create the pipeline-bound table


dodil k3 table create entities --bucket kb-prod \
  --description "Auto-extracted entities + PII from intake/" \
  --source pipeline_generated \
  --pipeline-template-id entity_pii_extraction

K3 atomically creates three things for you:

Created	What it is	Where it lives
`entities` table	Empty Delta table; schema materializes on first ingest	Tables service (`Table.source = TABLE_SOURCE_PIPELINE_GENERATED`)
A pipeline	Bound to `entity_pii_extraction` template + the `entities` table as destination	Pipelines service (`Pipeline`)
An ingest rule	Globs derived from the template’s `acceptedExtensions` (`*/.pdf`, `*/.txt`, etc.)	Pipelines service (`IngestRule`)

Verify the table row:


dodil k3 table get entities --bucket kb-prod -o json \
  | jq '{name, source, pipelineId, pipelineName, status}'

Expect source: TABLE_SOURCE_PIPELINE_GENERATED and pipelineId: pipe_... populated.

Optional folder_prefix: the CLI doesn’t expose it, but the API does — pass "folderPrefix": "intake" to scope the auto-generated rule’s globs to intake/**/*.pdf rather than **/*.pdf. Useful when multiple pipeline-tables accept the same extension and you want to keep them isolated. See API Reference → CreateTablePipeline.

4. Inspect the auto-generated rule

Find the rule via the Pipelines CLI — ingest list now supports -p to filter by pipeline:


PIPELINE_ID=$(dodil k3 table get entities --bucket kb-prod -o json | jq -r '.pipelineId')
 
dodil k3 ingest list -b kb-prod -p "$PIPELINE_ID" -o json \
  | jq '.rules[] | {ruleId, name, includePatterns, includeMimeTypes, enabled}'

Expect output like:


{
  "ruleId": "rule_a1b2…",
  "name": "auto-entity_pii_extraction",
  "includePatterns": ["**/*.pdf", "**/*.txt", "**/*.docx"],
  "includeMimeTypes": ["application/pdf", "text/plain"],
  "enabled": true
}

This is a regular IngestRule — manage it via dodil k3 ingest. For instance to pause ingestion without deleting the table:


RULE_ID="<rule_id from above>"
dodil k3 ingest update "$RULE_ID" -b kb-prod --enabled=false
 
# Re-enable later
dodil k3 ingest update "$RULE_ID" -b kb-prod --enabled=true

5. Upload test documents


# A PDF
curl -sSL https://example.com/sample-contract.pdf -o contract.pdf
dodil k3 object create ./contract.pdf -b kb-prod -k contracts/acme-2026.pdf
 
# A text-document-as-JSON (treated as text)
cat > ticket.json <<'EOF'
{
  "id": "TKT-1042",
  "customer": { "name": "Jane Doe", "email": "jane.doe@example.com" },
  "subject": "Login issues",
  "body": "Hi, I can't sign in to my account. My phone is +1-555-0142."
}
EOF
dodil k3 object create ./ticket.json -b kb-prod -k tickets/TKT-1042.json

Each upload fires through the auto-generated rule → spawns an ingest job → runs entity_pii_extraction → writes rows to the entities table.

Note: entity_pii_extraction accepts pdf + txt. JSON is text-like and gets accepted; the template treats the JSON as raw text and extracts entities from the string content. For structured-JSON-aware templates (preserving field structure), use the API with a custom entity_extraction Scriptum script — but that’s beyond this recipe.

6. Watch the ingest jobs


# All jobs for the bound pipeline
dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json \
  | jq '.jobs[] | {jobId, object: .object.key, status, rowsWritten, updatedAt}'

Status progression for a successful run:


PENDING ─► PROCESSING ─► COMPLETED
                │
                ▼  transient error
            RETRYING ─► PROCESSING ─► COMPLETED   (automatic retry)
                │
                ▼  permanent error (max attempts)
              FAILED

When COMPLETED, rowsWritten tells you how many extracted rows landed in the table:


{
  "jobId": "job_a1b2…",
  "object": { "bucket": "kb-prod", "key": "contracts/acme-2026.pdf" },
  "status": "INGEST_STATUS_COMPLETED",
  "rowsWritten": 23,
  "pipelineKind": "warehouse"
}

For replay / retry semantics (failed jobs, partial runs), see Pipelines → Replay & Retry.

7. Describe the table — see the materialized schema

After first successful ingest, the schema is now known. Describe to see what entity_pii_extraction actually produces:


dodil k3 table describe entities --bucket kb-prod -o json \
  | jq '{
      tableName, version,
      columns: [.columns[] | {name, type, nullable}],
      rowCount: .rowCount
    }'

Sample output (template-defined columns):


{
  "tableName": "entities",
  "version": "1",
  "columns": [
    {"name": "source_key",   "type": "COLUMN_TYPE_STRING",    "nullable": false},
    {"name": "entity_type",  "type": "COLUMN_TYPE_STRING",    "nullable": false},
    {"name": "entity_value", "type": "COLUMN_TYPE_STRING",    "nullable": false},
    {"name": "confidence",   "type": "COLUMN_TYPE_DOUBLE",    "nullable": true},
    {"name": "start_offset", "type": "COLUMN_TYPE_INT",       "nullable": true},
    {"name": "end_offset",   "type": "COLUMN_TYPE_INT",       "nullable": true},
    {"name": "is_pii",       "type": "COLUMN_TYPE_BOOLEAN",   "nullable": false},
    {"name": "extracted_at", "type": "COLUMN_TYPE_TIMESTAMP", "nullable": false}
  ],
  "rowCount": "23"
}

Exact columns vary by template — every Scriptum template defines its own output schema. The template’s typed ScriptContract (from step 2’s GetTemplate call) describes what to expect.

8. Query the auto-extracted rows


# All PII detected in a specific document
dodil k3 table query \
  "SELECT entity_type, entity_value FROM entities
   WHERE source_key = 'tickets/TKT-1042.json' AND is_pii = true" \
  --bucket kb-prod
 
# Top organizations across all extracted contracts
dodil k3 table query \
  "SELECT entity_value, COUNT(*) AS n FROM entities
   WHERE entity_type = 'ORGANIZATION'
     AND source_key LIKE 'contracts/%'
   GROUP BY 1 ORDER BY 2 DESC LIMIT 20" \
  --bucket kb-prod

(Exact entity_type enum values are template-defined — typical values include PERSON, ORGANIZATION, LOCATION, DATE, EMAIL, PHONE, SSN, CREDIT_CARD.)

Common gotchas

Symptom	Cause	Fix
`table create` succeeds but `describe` shows no `pipelineId`	Forgot `--source pipeline_generated`; created a manual table by mistake	Delete + recreate with `--source pipeline_generated --pipeline-template-id ...`
Documents upload but no ingest jobs spawn	Auto-generated rule’s globs don’t match your paths	Check the rule (step 4) — confirm `includePatterns` covers your path; remember `*/.pdf` matches recursively but `*.pdf` doesn’t
Jobs `COMPLETED` but `entities` table is empty in queries	Schema materialized on first ingest; reads are `EVENTUAL` and the rows are still in the write log	Run `dodil k3 table compact entities --bucket kb-prod`
`describe` returns columns but `rowCount` is 0	Template produced no extractable entities (blank document, unsupported encoding)	Spot-check with a richer document; if widespread, inspect the template’s contract via `GetTemplate` for required input fields
Some jobs `FAILED` while others `COMPLETED`	Template can’t process specific documents (corrupt PDF, wrong encoding, unsupported MIME)	Read `job.error` / `job.errorDetails`; for permanent failures, exclude the file or pre-process. See Pipelines → Replay & Retry.
Wrong destination kind (`vector` not `warehouse`)	Wrong template — picked a non-warehouse template	Confirm `warehouse_compatible: true` on the template before binding; non-warehouse templates target the Vector primitive
Re-uploaded same document → duplicate rows	K3 re-runs the pipeline on every successful PUT (incl. overwrites)	Idempotency is the table’s job — use a primary key in the table schema if you want dedup (but pipeline-bound tables use template-defined schemas; this means choosing a template that emits a stable PK or post-processing rows downstream)

Variations — other warehouse-compatible templates

Same recipe shape, different template:

Goal	Template	Sample rows produced
Document type / topic / language tags	`classification`	One row per document with classification labels
Summary + keywords	`summarization`	Multi-level summaries + extracted keywords per document
Sentiment + intent + toxicity	`sentiment_intent_analysis`	Sentiment scores, intent labels, urgency, toxicity per document
Triage (classify + extract)	`document_triage`	Combined classification + entity output for intake routing
OCR-only (no analysis)	`ocr_extraction`	Raw text rows per page / region
Translate documents	`translation`	Translated text rows
Audio → transcripts	`audio_transcription`	Speaker-diarized transcript segments
Object detection in images	`object_detection`	Bounding boxes + class labels per detection
Full image analysis	`image_understanding`	OCR text + detections + LLM-reasoned description
Code symbols + dependencies	`code_intelligence`	Symbol-level rows + dependency edges
Product attribute enrichment	`product_catalog_enrichment`	Enriched attributes per SKU
Customer review analytics	`review_analysis`	Sentiment + toxicity + keyword rows per review
Video surveillance	`video_surveillance`	Tracked objects + activity classifications

All follow the same CreateTablePipeline → upload → query shape. Swap the --pipeline-template-id and the table schema materializes from the chosen template on first ingest.