Skip to Content
We are live but in Staging 🎉
VectorRecipesPipeline Collection

Pipeline Collection

Goal: stand up a searchable vector collection where every uploaded document gets chunked + embedded + indexed automatically — no manual ingestion code.

Template used: text_embedding_index — the canonical text/PDF/docx/HTML/audio/video RAG-ingest template. The same pattern works for any *_embedding_index template in the vector catalog.

Shape:

document upload ──► bucket ──► auto-generated ingest rule matches Scriptum pipeline runs (chunks + embeds via Ignite) Milvus collection (lazy-materialized on first ingest) Search RPC

Prerequisites

  • dodil CLI + dodil login done — CLI Basics
  • A bucket — kb-prod:
    dodil k3 bucket create kb-prod -d "RAG corpus"
  • Vector engine configured on the bucket (auto mode):
    dodil k3 vector store create -b kb-prod -m auto # Wait for ACTIVE dodil k3 vector store get -b kb-prod -o json | jq '.status'

1. Browse + pick a template

dodil k3 vector templates -o json | jq '.templates[] | {id, modalities, acceptedExtensions}'

For text + PDF + docx ingest, pick text_embedding_index. Inspect its contract to see what (if any) template_inputs it needs:

dodil k3 template get text_embedding_index -o json | jq '.contract.inputs'

text_embedding_index has no required runtime inputs — everything has a contract default. Other templates may require inputs:

TemplateRequired template_inputsNotes
text_embedding_indexnoneChunk size + embed model come from contract defaults
code_embedding_indexnoneLanguage auto-detected from content_type / extension
visual_embedding_indexnoneModality auto-routed (image / video / audio / pdf)
face_embedding_indexnoneSCRFD face detection then embed each face crop
object_embedding_indexlabels: [string]Open-vocabulary detection — you supply the label vocabulary

2. Create the collection (pipeline-mode)

For text_embedding_index (no required inputs), the CLI works directly:

dodil k3 vector collection add docs -b kb-prod \ --description "PDF / docx / HTML embeddings" \ --template text_embedding_index \ -o json export COLLECTION_ID=$(dodil k3 vector collection get docs -b kb-prod -o json | jq -r '.collectionId')

For object_embedding_index (requires labels), use the API:

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/pipelines" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "name": "products", "description": "Product image objects", "templateId": "object_embedding_index", "templateInputs": { "labels": ["bottle", "bag", "shoe", "watch", "jewelry"] } }'

What K3 atomically created

AddVectorPipeline does three things in one call:

CreatedWhere it livesWhat it does
docs collection rowVector serviceSchema lazy — Milvus collection materializes on first ingest
A Scriptum pipelinePipelines serviceBound to text_embedding_index; destination = the new collection
An auto-generated ingest rulePipelines serviceGlobs from the template’s acceptedExtensions (**/*.pdf, **/*.txt, …)

Inspect the bound pipeline:

PIPELINE_ID=$(dodil k3 vector collection get docs -b kb-prod -o json | jq -r '.embedPipelineId') # The pipeline dodil k3 pipeline get "$PIPELINE_ID" -b kb-prod -o json \ | jq '{name, scriptumTemplate, storeEntityKind, storeEntityName}' # The auto-rule dodil k3 ingest list -b kb-prod -p "$PIPELINE_ID" -o json \ | jq '.rules[] | {ruleId, name, includePatterns, enabled}'

3. Upload documents

# A PDF curl -sSL https://arxiv.org/pdf/1706.03762.pdf -o attention.pdf dodil k3 object create ./attention.pdf -b kb-prod -k papers/attention.pdf # Plain text cat > intro.txt <<'EOF' The Transformer architecture introduced multi-head attention as a way for the model to jointly attend to information from different representation subspaces at different positions. EOF dodil k3 object create ./intro.txt -b kb-prod -k papers/intro.txt

Each upload fires through the auto-generated rule → spawns an ingest job → runs text_embedding_index → writes embeddings to the docs collection.

4. Watch the ingest jobs

dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json \ | jq '.jobs[] | {object: .object.key, status, chunksCreated, embeddingsWritten}'

Status path: PENDING → PROCESSING → COMPLETED. When done:

[ {"object": "papers/attention.pdf", "status": "INGEST_STATUS_COMPLETED", "chunksCreated": 47, "embeddingsWritten": 47}, {"object": "papers/intro.txt", "status": "INGEST_STATUS_COMPLETED", "chunksCreated": 1, "embeddingsWritten": 1} ]

chunksCreated == embeddingsWritten is the happy path. If embeddingsWritten is lower, see Pipelines → Replay & Retry.

# Simple text search dodil k3 search "what is multi-head attention" -b kb-prod -c docs -o json \ | jq '.results[] | {score, object: .object.key, content}'

For richer queries (hybrid mode, rerank, pre-filter, include_content), drop to API — see Recipes → Hybrid + Rerank:

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "text": "what is multi-head attention", "collectionName": "docs", "topK": 5, "searchMode": "SEARCH_MODE_AUTO", "rerank": true, "includeContent": true }' | jq '.results[] | {score, object: .object.key, content: (.content | .[0:200])}'

6. Inspect the resolved schema

After first ingest, the Milvus collection has materialized. Confirm the schema (dimensions, embed_model, sparse_mode) — useful for sanity checks before doing multi-collection search:

dodil k3 vector collection get docs -b kb-prod -o json \ | jq '{ status, dimensions, embeddingType, distanceMetric, sparseMode, embedModel, modality, milvusCollection }'

For text_embedding_index at the time of writing:

{ "status": "COLLECTION_STATUS_ACTIVE", "dimensions": 1024, "embeddingType": "EMBEDDING_TYPE_FLOAT", "distanceMetric": "DISTANCE_METRIC_COSINE", "sparseMode": "SPARSE_MODE_BM25", "embedModel": "jina-embeddings-v4", "modality": "text", "milvusCollection": "kb_prod_docs_a1b2" }

Common gotchas

SymptomCauseFix
vector collection add fails with engine not activeEngine still PROVISIONINGdodil k3 vector store get -b kb-prod -o json | jq .status — wait for ACTIVE
Upload succeeds but no ingest job spawnsThe auto-generated rule’s globs don’t match the path/extensionList the rule (see step 2), confirm includePatterns covers your path
Jobs COMPLETED but search returns nothingMilvus index still building (the first ingest is slower)Wait 10–30 s after the first ingest, then re-search
add rejects with “template requires field ‘labels‘“object_embedding_indexlabels is requiredUse the API call with templateInputs.labels
Need to pause ingestion without deletingUse the bound rule’s enabled flagdodil k3 ingest update $RULE_ID -b kb-prod --enabled=false
Re-uploaded same key → duplicate embeddingsK3 fires the pipeline on every PUT (incl. overwrites)Use UpsertVectors semantics (not available for pipeline-mode); for dedup, post-process via the external-collection flow

Variations — other vector templates

Same recipe, different template:

Want to indexTemplateNotes
Source codecode_embedding_indexAST-aware chunking via tree-sitter (rust / python / js / ts / go / java / cpp / …)
Mixed media (images / video / audio / PDF page renders)visual_embedding_indexMultimodal — useful for combined visual + textual recall
Faces in photosface_embedding_indexSCRFD detection + per-face crops + embed
Objects in images (open-vocab)object_embedding_indexRequires labels runtime input

All four are pipeline-mode with the same atomic create + auto-rule flow.

Cleanup

# 1. Disable the rule first (stops new ingests) dodil k3 ingest update "$RULE_ID" -b kb-prod --enabled=false # 2. Delete the rule + pipeline + collection (any order) dodil k3 ingest delete "$RULE_ID" -b kb-prod dodil k3 pipeline delete "$PIPELINE_ID" -b kb-prod dodil k3 vector collection delete "$COLLECTION_ID" -b kb-prod # 3. (Optional) Delete the bucket dodil k3 object remove papers/attention.pdf -b kb-prod dodil k3 object remove papers/intro.txt -b kb-prod dodil k3 bucket delete kb-prod

See also