PDF → Vector collection

Goal: every PDF uploaded to a bucket is automatically chunked, embedded, and indexed into a vector collection — searchable end-to-end without any glue code.

Template: text_embedding_index — K3’s production template for text/PDF/docx/HTML/audio/video embedding. Resolves a source object, chunks it, embeds the chunks, writes them to a vector collection.

Shape:


   PDF upload  ──►  bucket  ──►  rule matches  ──►  text_embedding_index
                                                          │
                                                          ▼
                                                  Vector collection
                                                  (chunks + embeddings)
                                                          │
                                                          ▼
                                                   Vector search

Prerequisites

dodil CLI installed and dodil login done — CLI Basics

A bucket (we’ll use kb-prod):


dodil k3 bucket create kb-prod -d "RAG corpus"

A vector engine + collection in that bucket. See Vector — Quickstart. We’ll assume:
- vector engine: vec-kb-prod
- collection: docs-index
- the collection’s store_entity_id (you’ll capture it in step 1)

1. Capture the destination’s `store_entity_id`

The pipeline binds to your vector collection through its store_entity_id:


dodil k3 vector list --bucket kb-prod -o json \
  | jq -r '.collections[] | select(.name == "docs-index") | .storeEntityId'
 
export COLLECTION_ID="<the store_entity_id from above>"

2. Inspect the template (optional but recommended)

Before spawning a pipeline, look at the template’s typed contract — it tells you exactly what options it accepts:


dodil k3 template get text_embedding_index -o json | jq '{name, description, labels}'
dodil k3 template get text_embedding_index -o json | jq '.contract.inputs'

Common knobs for this template (defaults are sensible — override only if needed):

Option	Meaning
`chunk_size`	Tokens per chunk
`chunk_overlap`	Tokens overlapping between adjacent chunks
`embed_model`	Embedding model identifier (resolved from the template by default)

3. Create the pipeline


dodil k3 pipeline create embed-pdfs \
  --bucket kb-prod \
  --scriptum text_embedding_index \
  -o json

Capture the pipeline_id:


export PIPELINE_ID="<pipeline_id from the response>"

The CLI’s pipeline create doesn’t yet set store_entity_id or options directly — bind the destination and tune options with update:


# Bind to the vector collection
dodil k3 pipeline update "$PIPELINE_ID" --bucket kb-prod \
  --store-entity-id "$COLLECTION_ID"
 
# (Optional) tune chunking
dodil k3 pipeline update "$PIPELINE_ID" --bucket kb-prod \
  --options-json '{"chunk_size":"1000","chunk_overlap":"150"}'

Verify the binding took:


dodil k3 pipeline get "$PIPELINE_ID" --bucket kb-prod -o json \
  | jq '{name, scriptumTemplate, storeEntityKind, storeEntityName, options}'

storeEntityKind should be "vector" — that’s K3’s confirmation that your pipeline is a vector pipeline.

4. Look up the bucket’s internal source

Every bucket gets an auto-created internal source on CreateBucket. We need its source_id to attach a rule:


export SOURCE_ID=$(curl -sS "https://k3.dev.dodil.io/kb-prod/sources" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  | jq -r '.sources[] | select(.name == "internal") | .sourceId')
echo "internal source = $SOURCE_ID"

dodil k3 source list isn’t in the CLI yet — that’s why we drop to curl here. See CLI Guide → source.

5. Create the rule

The rule says “any PDF anywhere in this bucket fires the pipeline”:


dodil k3 ingest add pdf-rule \
  --bucket kb-prod \
  --source "$SOURCE_ID" \
  --collection "$PIPELINE_ID" \
  --include "**/*.pdf"

The CLI’s --collection flag maps to the API’s pipeline_id — historical naming.

Tighter scoping (e.g. only contracts/ PDFs) — replace the include pattern:


dodil k3 ingest add contracts-pdf-rule \
  --bucket kb-prod \
  --source "$SOURCE_ID" \
  --collection "$PIPELINE_ID" \
  --include "contracts/**/*.pdf"

Exclude / MIME / size filters aren’t on ingest add flags — set them with ingest update after creation:


# Capture the rule ID first (CLI emits the IngestRule on add)
RULE_ID="<rule_id from the add response>"
 
# Exclude drafts, require ≥ 1 KB
dodil k3 ingest update "$RULE_ID" -b kb-prod \
  --exclude "**/draft-*" \
  --min-size 1024

6. Upload a test PDF


# Any PDF — here, the "Attention Is All You Need" paper
curl -sSL https://arxiv.org/pdf/1706.03762.pdf -o attention.pdf
 
dodil k3 object create ./attention.pdf -b kb-prod -k papers/attention.pdf

The PUT completes immediately; the pipeline runs asynchronously.

7. Watch the ingest job


# All jobs for our rule
dodil k3 ingest jobs --bucket kb-prod --rule "$RULE_ID" -o json
 
# Or filter by pipeline — useful when multiple rules feed the same pipeline
dodil k3 ingest jobs --bucket kb-prod --pipeline "$PIPELINE_ID" -o json \
  | jq '.jobs[] | {jobId, status, chunksCreated, embeddingsCreated, embeddingsWritten}'

Status progression for a healthy run:


PENDING ─► PROCESSING ─► COMPLETED
                │
                ▼  transient failure
            RETRYING ─► PROCESSING ─► COMPLETED   (K3 retries automatically)
                │
                ▼  permanent failure (max attempts reached)
              FAILED

When COMPLETED, you’ll see:


{
  "jobId": "job_a1b2…",
  "status": "INGEST_STATUS_COMPLETED",
  "chunksCreated": 47,
  "embeddingsCreated": 47,
  "embeddingsWritten": 47,
  "vectorStatus": "success",
  "batchesReceived": 5
}

chunksCreated == embeddingsCreated == embeddingsWritten is the happy path. If embeddingsWritten < embeddingsCreated, some embeddings didn’t land in the collection — inspect the Vector primitive for collection health.

Cross-check the object’s pipeline status without leaving Storage:


dodil k3 object show papers/attention.pdf -b kb-prod -o json \
  | jq '.pipelineStatuses[]'

One entry per rule that ran — aggregates align with the job’s counters.

8. Search the result

The embeddings are now live in docs-index. Query them:


dodil k3 search --bucket kb-prod --collection docs-index \
  --query "what is multi-head attention" -o json

You should see chunks from papers/attention.pdf ranked by similarity. The search side uses the matching text_embedding_search template — run ad-hoc, no separate pipeline needed.

Common gotchas

Symptom	Cause	Fix
Job stuck in `PENDING` for > a minute	Pipeline binding missing	`dodil k3 pipeline get` — confirm `storeEntityKind == "vector"` and `storeEntityId` is set
`FAILED` immediately	Object wasn’t extractable (corrupt PDF, unsupported encoding)	Check `error_details` on the job; try a known-good PDF
`chunksCreated > 0` but `embeddingsWritten == 0`	Vector collection misconfigured (wrong dimension, missing index)	Inspect via Vector; recreate the collection if needed
No job ever spawns	Rule didn’t match the object’s path	Verify `includePatterns` with `dodil k3 ingest get <rule_id>`; remember `*/.pdf` ≠ `*.pdf`
New rule, but old objects aren’t indexed	Existing objects don’t auto-replay	Run `TriggerIngestion` with `retry_failed: false` to backfill — see Replay & Retry
`embeddingsWritten` drifts below `embeddingsCreated` over time	Background collection compaction or external delete	Check the collection’s health in the Vector primitive; reindex if needed

Variations

Same recipe shape, different template:

Want to index	Template	Notes
Source code (Rust, Python, JS, Go, …)	`code_embedding_index`	AST-aware chunking via tree-sitter; respects function / class boundaries
Mixed media (images, video frames, audio spectrograms, PDF page renders)	`visual_embedding_index`	Multimodal — useful when you want visual + textual recall over the same corpus
Faces from photos	`face_embedding_index`	SCRFD detection + embedding; one face → one chunk
Objects in images (open-vocabulary)	`object_embedding_index`	Open-vocabulary detection, not just YOLO classes

All four follow this exact recipe — just change step 3’s --scriptum and step 2’s expectations of the contract.