PDF → Vector collection
Goal: every PDF uploaded to a bucket is automatically chunked, embedded, and indexed into a vector collection — searchable end-to-end without any glue code.
Template: text_embedding_index — K3’s production template for text/PDF/docx/HTML/audio/video embedding. Resolves a source object, chunks it, embeds the chunks, writes them to a vector collection.
Shape:
PDF upload ──► bucket ──► rule matches ──► text_embedding_index
│
▼
Vector collection
(chunks + embeddings)
│
▼
Vector searchPrerequisites
dodilCLI installed anddodil logindone — CLI Basics- A bucket (we’ll use
kb-prod):dodil k3 bucket create kb-prod -d "RAG corpus" - A vector engine + collection in that bucket. See Vector — Quickstart. We’ll assume:
- vector engine:
vec-kb-prod - collection:
docs-index - the collection’s
store_entity_id(you’ll capture it in step 1)
- vector engine:
1. Capture the destination’s store_entity_id
The pipeline binds to your vector collection through its store_entity_id:
dodil k3 vector list --bucket kb-prod -o json \
| jq -r '.collections[] | select(.name == "docs-index") | .storeEntityId'
export COLLECTION_ID="<the store_entity_id from above>"2. Inspect the template (optional but recommended)
Before spawning a pipeline, look at the template’s typed contract — it tells you exactly what options it accepts:
dodil k3 template get text_embedding_index -o json | jq '{name, description, labels}'
dodil k3 template get text_embedding_index -o json | jq '.contract.inputs'Common knobs for this template (defaults are sensible — override only if needed):
| Option | Meaning |
|---|---|
chunk_size | Tokens per chunk |
chunk_overlap | Tokens overlapping between adjacent chunks |
embed_model | Embedding model identifier (resolved from the template by default) |
3. Create the pipeline
dodil k3 pipeline create embed-pdfs \
--bucket kb-prod \
--scriptum text_embedding_index \
-o jsonCapture the pipeline_id:
export PIPELINE_ID="<pipeline_id from the response>"The CLI’s pipeline create doesn’t yet set store_entity_id or options directly — bind the destination and tune options with update:
# Bind to the vector collection
dodil k3 pipeline update "$PIPELINE_ID" --bucket kb-prod \
--store-entity-id "$COLLECTION_ID"
# (Optional) tune chunking
dodil k3 pipeline update "$PIPELINE_ID" --bucket kb-prod \
--options-json '{"chunk_size":"1000","chunk_overlap":"150"}'Verify the binding took:
dodil k3 pipeline get "$PIPELINE_ID" --bucket kb-prod -o json \
| jq '{name, scriptumTemplate, storeEntityKind, storeEntityName, options}'storeEntityKind should be "vector" — that’s K3’s confirmation that your pipeline is a vector pipeline.
4. Look up the bucket’s internal source
Every bucket gets an auto-created internal source on CreateBucket. We need its source_id to attach a rule:
export SOURCE_ID=$(curl -sS "https://k3.dev.dodil.io/kb-prod/sources" \
-H "Authorization: Bearer $DODIL_TOKEN" \
| jq -r '.sources[] | select(.name == "internal") | .sourceId')
echo "internal source = $SOURCE_ID"
dodil k3 source listisn’t in the CLI yet — that’s why we drop to curl here. See CLI Guide → source.
5. Create the rule
The rule says “any PDF anywhere in this bucket fires the pipeline”:
dodil k3 ingest add pdf-rule \
--bucket kb-prod \
--source "$SOURCE_ID" \
--collection "$PIPELINE_ID" \
--include "**/*.pdf"The CLI’s
--collectionflag maps to the API’spipeline_id— historical naming.
Tighter scoping (e.g. only contracts/ PDFs) — replace the include pattern:
dodil k3 ingest add contracts-pdf-rule \
--bucket kb-prod \
--source "$SOURCE_ID" \
--collection "$PIPELINE_ID" \
--include "contracts/**/*.pdf"Exclude / MIME / size filters aren’t on ingest add flags — set them with ingest update after creation:
# Capture the rule ID first (CLI emits the IngestRule on add)
RULE_ID="<rule_id from the add response>"
# Exclude drafts, require ≥ 1 KB
dodil k3 ingest update "$RULE_ID" -b kb-prod \
--exclude "**/draft-*" \
--min-size 10246. Upload a test PDF
# Any PDF — here, the "Attention Is All You Need" paper
curl -sSL https://arxiv.org/pdf/1706.03762.pdf -o attention.pdf
dodil k3 object create ./attention.pdf -b kb-prod -k papers/attention.pdfThe PUT completes immediately; the pipeline runs asynchronously.
7. Watch the ingest job
# All jobs for our rule
dodil k3 ingest jobs --bucket kb-prod --rule "$RULE_ID" -o json
# Or filter by pipeline — useful when multiple rules feed the same pipeline
dodil k3 ingest jobs --bucket kb-prod --pipeline "$PIPELINE_ID" -o json \
| jq '.jobs[] | {jobId, status, chunksCreated, embeddingsCreated, embeddingsWritten}'Status progression for a healthy run:
PENDING ─► PROCESSING ─► COMPLETED
│
▼ transient failure
RETRYING ─► PROCESSING ─► COMPLETED (K3 retries automatically)
│
▼ permanent failure (max attempts reached)
FAILEDWhen COMPLETED, you’ll see:
{
"jobId": "job_a1b2…",
"status": "INGEST_STATUS_COMPLETED",
"chunksCreated": 47,
"embeddingsCreated": 47,
"embeddingsWritten": 47,
"vectorStatus": "success",
"batchesReceived": 5
}chunksCreated == embeddingsCreated == embeddingsWritten is the happy path. If embeddingsWritten < embeddingsCreated, some embeddings didn’t land in the collection — inspect the Vector primitive for collection health.
Cross-check the object’s pipeline status without leaving Storage:
dodil k3 object show papers/attention.pdf -b kb-prod -o json \
| jq '.pipelineStatuses[]'One entry per rule that ran — aggregates align with the job’s counters.
8. Search the result
The embeddings are now live in docs-index. Query them:
dodil k3 search --bucket kb-prod --collection docs-index \
--query "what is multi-head attention" -o jsonYou should see chunks from papers/attention.pdf ranked by similarity. The search side uses the matching text_embedding_search template — run ad-hoc, no separate pipeline needed.
Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
Job stuck in PENDING for > a minute | Pipeline binding missing | dodil k3 pipeline get — confirm storeEntityKind == "vector" and storeEntityId is set |
FAILED immediately | Object wasn’t extractable (corrupt PDF, unsupported encoding) | Check error_details on the job; try a known-good PDF |
chunksCreated > 0 but embeddingsWritten == 0 | Vector collection misconfigured (wrong dimension, missing index) | Inspect via Vector; recreate the collection if needed |
| No job ever spawns | Rule didn’t match the object’s path | Verify includePatterns with dodil k3 ingest get <rule_id>; remember **/*.pdf ≠ *.pdf |
| New rule, but old objects aren’t indexed | Existing objects don’t auto-replay | Run TriggerIngestion with retry_failed: false to backfill — see Replay & Retry |
embeddingsWritten drifts below embeddingsCreated over time | Background collection compaction or external delete | Check the collection’s health in the Vector primitive; reindex if needed |
Variations
Same recipe shape, different template:
| Want to index | Template | Notes |
|---|---|---|
| Source code (Rust, Python, JS, Go, …) | code_embedding_index | AST-aware chunking via tree-sitter; respects function / class boundaries |
| Mixed media (images, video frames, audio spectrograms, PDF page renders) | visual_embedding_index | Multimodal — useful when you want visual + textual recall over the same corpus |
| Faces from photos | face_embedding_index | SCRFD detection + embedding; one face → one chunk |
| Objects in images (open-vocabulary) | object_embedding_index | Open-vocabulary detection, not just YOLO classes |
All four follow this exact recipe — just change step 3’s --scriptum and step 2’s expectations of the contract.
See also
- Replay & Retry — recover from
FAILEDjobs, replay after pipeline changes - Documents → Warehouse — same flow but writing structured rows instead of embeddings
- Quickstart — the abbreviated version of this recipe
- Templates → The catalog — the other 23 production templates
- Vector — the destination primitive (Milvus collections)