Skip to Content
We are live but in Staging 🎉
PipelinesRecipesPDF → Vector

PDF → Vector collection

Goal: every PDF uploaded to a bucket is automatically chunked, embedded, and indexed into a vector collection — searchable end-to-end without any glue code.

Template: text_embedding_index — K3’s production template for text/PDF/docx/HTML/audio/video embedding. Resolves a source object, chunks it, embeds the chunks, writes them to a vector collection.

Shape:

PDF upload ──► bucket ──► rule matches ──► text_embedding_index Vector collection (chunks + embeddings) Vector search

Prerequisites

  • dodil CLI installed and dodil login done — CLI Basics
  • A bucket (we’ll use kb-prod):
    dodil k3 bucket create kb-prod -d "RAG corpus"
  • A vector engine + collection in that bucket. See Vector — Quickstart. We’ll assume:
    • vector engine: vec-kb-prod
    • collection: docs-index
    • the collection’s store_entity_id (you’ll capture it in step 1)

1. Capture the destination’s store_entity_id

The pipeline binds to your vector collection through its store_entity_id:

dodil k3 vector list --bucket kb-prod -o json \ | jq -r '.collections[] | select(.name == "docs-index") | .storeEntityId' export COLLECTION_ID="<the store_entity_id from above>"

Before spawning a pipeline, look at the template’s typed contract — it tells you exactly what options it accepts:

dodil k3 template get text_embedding_index -o json | jq '{name, description, labels}' dodil k3 template get text_embedding_index -o json | jq '.contract.inputs'

Common knobs for this template (defaults are sensible — override only if needed):

OptionMeaning
chunk_sizeTokens per chunk
chunk_overlapTokens overlapping between adjacent chunks
embed_modelEmbedding model identifier (resolved from the template by default)

3. Create the pipeline

dodil k3 pipeline create embed-pdfs \ --bucket kb-prod \ --scriptum text_embedding_index \ -o json

Capture the pipeline_id:

export PIPELINE_ID="<pipeline_id from the response>"

The CLI’s pipeline create doesn’t yet set store_entity_id or options directly — bind the destination and tune options with update:

# Bind to the vector collection dodil k3 pipeline update "$PIPELINE_ID" --bucket kb-prod \ --store-entity-id "$COLLECTION_ID" # (Optional) tune chunking dodil k3 pipeline update "$PIPELINE_ID" --bucket kb-prod \ --options-json '{"chunk_size":"1000","chunk_overlap":"150"}'

Verify the binding took:

dodil k3 pipeline get "$PIPELINE_ID" --bucket kb-prod -o json \ | jq '{name, scriptumTemplate, storeEntityKind, storeEntityName, options}'

storeEntityKind should be "vector" — that’s K3’s confirmation that your pipeline is a vector pipeline.

4. Look up the bucket’s internal source

Every bucket gets an auto-created internal source on CreateBucket. We need its source_id to attach a rule:

export SOURCE_ID=$(curl -sS "https://k3.dev.dodil.io/kb-prod/sources" \ -H "Authorization: Bearer $DODIL_TOKEN" \ | jq -r '.sources[] | select(.name == "internal") | .sourceId') echo "internal source = $SOURCE_ID"

dodil k3 source list isn’t in the CLI yet — that’s why we drop to curl here. See CLI Guide → source.

5. Create the rule

The rule says “any PDF anywhere in this bucket fires the pipeline”:

dodil k3 ingest add pdf-rule \ --bucket kb-prod \ --source "$SOURCE_ID" \ --collection "$PIPELINE_ID" \ --include "**/*.pdf"

The CLI’s --collection flag maps to the API’s pipeline_id — historical naming.

Tighter scoping (e.g. only contracts/ PDFs) — replace the include pattern:

dodil k3 ingest add contracts-pdf-rule \ --bucket kb-prod \ --source "$SOURCE_ID" \ --collection "$PIPELINE_ID" \ --include "contracts/**/*.pdf"

Exclude / MIME / size filters aren’t on ingest add flags — set them with ingest update after creation:

# Capture the rule ID first (CLI emits the IngestRule on add) RULE_ID="<rule_id from the add response>" # Exclude drafts, require ≥ 1 KB dodil k3 ingest update "$RULE_ID" -b kb-prod \ --exclude "**/draft-*" \ --min-size 1024

6. Upload a test PDF

# Any PDF — here, the "Attention Is All You Need" paper curl -sSL https://arxiv.org/pdf/1706.03762.pdf -o attention.pdf dodil k3 object create ./attention.pdf -b kb-prod -k papers/attention.pdf

The PUT completes immediately; the pipeline runs asynchronously.

7. Watch the ingest job

# All jobs for our rule dodil k3 ingest jobs --bucket kb-prod --rule "$RULE_ID" -o json # Or filter by pipeline — useful when multiple rules feed the same pipeline dodil k3 ingest jobs --bucket kb-prod --pipeline "$PIPELINE_ID" -o json \ | jq '.jobs[] | {jobId, status, chunksCreated, embeddingsCreated, embeddingsWritten}'

Status progression for a healthy run:

PENDING ─► PROCESSING ─► COMPLETED ▼ transient failure RETRYING ─► PROCESSING ─► COMPLETED (K3 retries automatically) ▼ permanent failure (max attempts reached) FAILED

When COMPLETED, you’ll see:

{ "jobId": "job_a1b2…", "status": "INGEST_STATUS_COMPLETED", "chunksCreated": 47, "embeddingsCreated": 47, "embeddingsWritten": 47, "vectorStatus": "success", "batchesReceived": 5 }

chunksCreated == embeddingsCreated == embeddingsWritten is the happy path. If embeddingsWritten < embeddingsCreated, some embeddings didn’t land in the collection — inspect the Vector primitive for collection health.

Cross-check the object’s pipeline status without leaving Storage:

dodil k3 object show papers/attention.pdf -b kb-prod -o json \ | jq '.pipelineStatuses[]'

One entry per rule that ran — aggregates align with the job’s counters.

8. Search the result

The embeddings are now live in docs-index. Query them:

dodil k3 search --bucket kb-prod --collection docs-index \ --query "what is multi-head attention" -o json

You should see chunks from papers/attention.pdf ranked by similarity. The search side uses the matching text_embedding_search template — run ad-hoc, no separate pipeline needed.

Common gotchas

SymptomCauseFix
Job stuck in PENDING for > a minutePipeline binding missingdodil k3 pipeline get — confirm storeEntityKind == "vector" and storeEntityId is set
FAILED immediatelyObject wasn’t extractable (corrupt PDF, unsupported encoding)Check error_details on the job; try a known-good PDF
chunksCreated > 0 but embeddingsWritten == 0Vector collection misconfigured (wrong dimension, missing index)Inspect via Vector; recreate the collection if needed
No job ever spawnsRule didn’t match the object’s pathVerify includePatterns with dodil k3 ingest get <rule_id>; remember **/*.pdf*.pdf
New rule, but old objects aren’t indexedExisting objects don’t auto-replayRun TriggerIngestion with retry_failed: false to backfill — see Replay & Retry
embeddingsWritten drifts below embeddingsCreated over timeBackground collection compaction or external deleteCheck the collection’s health in the Vector primitive; reindex if needed

Variations

Same recipe shape, different template:

Want to indexTemplateNotes
Source code (Rust, Python, JS, Go, …)code_embedding_indexAST-aware chunking via tree-sitter; respects function / class boundaries
Mixed media (images, video frames, audio spectrograms, PDF page renders)visual_embedding_indexMultimodal — useful when you want visual + textual recall over the same corpus
Faces from photosface_embedding_indexSCRFD detection + embedding; one face → one chunk
Objects in images (open-vocabulary)object_embedding_indexOpen-vocabulary detection, not just YOLO classes

All four follow this exact recipe — just change step 3’s --scriptum and step 2’s expectations of the contract.

See also