Pipeline Collection
Goal: stand up a searchable vector collection where every uploaded document gets chunked + embedded + indexed automatically — no manual ingestion code.
Template used: text_embedding_index — the canonical text/PDF/docx/HTML/audio/video RAG-ingest template. The same pattern works for any *_embedding_index template in the vector catalog.
Shape:
document upload ──► bucket ──► auto-generated ingest rule matches
│
▼
Scriptum pipeline runs
(chunks + embeds via Ignite)
│
▼
Milvus collection
(lazy-materialized on first ingest)
│
▼
Search RPCPrerequisites
dodilCLI +dodil logindone — CLI Basics- A bucket —
kb-prod:dodil k3 bucket create kb-prod -d "RAG corpus" - Vector engine configured on the bucket (auto mode):
dodil k3 vector store create -b kb-prod -m auto # Wait for ACTIVE dodil k3 vector store get -b kb-prod -o json | jq '.status'
1. Browse + pick a template
dodil k3 vector templates -o json | jq '.templates[] | {id, modalities, acceptedExtensions}'For text + PDF + docx ingest, pick text_embedding_index. Inspect its contract to see what (if any) template_inputs it needs:
dodil k3 template get text_embedding_index -o json | jq '.contract.inputs'text_embedding_index has no required runtime inputs — everything has a contract default. Other templates may require inputs:
| Template | Required template_inputs | Notes |
|---|---|---|
text_embedding_index | none | Chunk size + embed model come from contract defaults |
code_embedding_index | none | Language auto-detected from content_type / extension |
visual_embedding_index | none | Modality auto-routed (image / video / audio / pdf) |
face_embedding_index | none | SCRFD face detection then embed each face crop |
object_embedding_index | labels: [string] | Open-vocabulary detection — you supply the label vocabulary |
2. Create the collection (pipeline-mode)
For text_embedding_index (no required inputs), the CLI works directly:
dodil k3 vector collection add docs -b kb-prod \
--description "PDF / docx / HTML embeddings" \
--template text_embedding_index \
-o json
export COLLECTION_ID=$(dodil k3 vector collection get docs -b kb-prod -o json | jq -r '.collectionId')For object_embedding_index (requires labels), use the API:
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/pipelines" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"name": "products",
"description": "Product image objects",
"templateId": "object_embedding_index",
"templateInputs": {
"labels": ["bottle", "bag", "shoe", "watch", "jewelry"]
}
}'What K3 atomically created
AddVectorPipeline does three things in one call:
| Created | Where it lives | What it does |
|---|---|---|
docs collection row | Vector service | Schema lazy — Milvus collection materializes on first ingest |
| A Scriptum pipeline | Pipelines service | Bound to text_embedding_index; destination = the new collection |
| An auto-generated ingest rule | Pipelines service | Globs from the template’s acceptedExtensions (**/*.pdf, **/*.txt, …) |
Inspect the bound pipeline:
PIPELINE_ID=$(dodil k3 vector collection get docs -b kb-prod -o json | jq -r '.embedPipelineId')
# The pipeline
dodil k3 pipeline get "$PIPELINE_ID" -b kb-prod -o json \
| jq '{name, scriptumTemplate, storeEntityKind, storeEntityName}'
# The auto-rule
dodil k3 ingest list -b kb-prod -p "$PIPELINE_ID" -o json \
| jq '.rules[] | {ruleId, name, includePatterns, enabled}'3. Upload documents
# A PDF
curl -sSL https://arxiv.org/pdf/1706.03762.pdf -o attention.pdf
dodil k3 object create ./attention.pdf -b kb-prod -k papers/attention.pdf
# Plain text
cat > intro.txt <<'EOF'
The Transformer architecture introduced multi-head attention as a way for
the model to jointly attend to information from different representation
subspaces at different positions.
EOF
dodil k3 object create ./intro.txt -b kb-prod -k papers/intro.txtEach upload fires through the auto-generated rule → spawns an ingest job → runs text_embedding_index → writes embeddings to the docs collection.
4. Watch the ingest jobs
dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json \
| jq '.jobs[] | {object: .object.key, status, chunksCreated, embeddingsWritten}'Status path: PENDING → PROCESSING → COMPLETED. When done:
[
{"object": "papers/attention.pdf", "status": "INGEST_STATUS_COMPLETED", "chunksCreated": 47, "embeddingsWritten": 47},
{"object": "papers/intro.txt", "status": "INGEST_STATUS_COMPLETED", "chunksCreated": 1, "embeddingsWritten": 1}
]chunksCreated == embeddingsWritten is the happy path. If embeddingsWritten is lower, see Pipelines → Replay & Retry.
5. Search
# Simple text search
dodil k3 search "what is multi-head attention" -b kb-prod -c docs -o json \
| jq '.results[] | {score, object: .object.key, content}'For richer queries (hybrid mode, rerank, pre-filter, include_content), drop to API — see Recipes → Hybrid + Rerank:
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"text": "what is multi-head attention",
"collectionName": "docs",
"topK": 5,
"searchMode": "SEARCH_MODE_AUTO",
"rerank": true,
"includeContent": true
}' | jq '.results[] | {score, object: .object.key, content: (.content | .[0:200])}'6. Inspect the resolved schema
After first ingest, the Milvus collection has materialized. Confirm the schema (dimensions, embed_model, sparse_mode) — useful for sanity checks before doing multi-collection search:
dodil k3 vector collection get docs -b kb-prod -o json \
| jq '{
status, dimensions, embeddingType, distanceMetric,
sparseMode, embedModel, modality, milvusCollection
}'For text_embedding_index at the time of writing:
{
"status": "COLLECTION_STATUS_ACTIVE",
"dimensions": 1024,
"embeddingType": "EMBEDDING_TYPE_FLOAT",
"distanceMetric": "DISTANCE_METRIC_COSINE",
"sparseMode": "SPARSE_MODE_BM25",
"embedModel": "jina-embeddings-v4",
"modality": "text",
"milvusCollection": "kb_prod_docs_a1b2"
}Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
vector collection add fails with engine not active | Engine still PROVISIONING | dodil k3 vector store get -b kb-prod -o json | jq .status — wait for ACTIVE |
| Upload succeeds but no ingest job spawns | The auto-generated rule’s globs don’t match the path/extension | List the rule (see step 2), confirm includePatterns covers your path |
Jobs COMPLETED but search returns nothing | Milvus index still building (the first ingest is slower) | Wait 10–30 s after the first ingest, then re-search |
add rejects with “template requires field ‘labels‘“ | object_embedding_index — labels is required | Use the API call with templateInputs.labels |
| Need to pause ingestion without deleting | Use the bound rule’s enabled flag | dodil k3 ingest update $RULE_ID -b kb-prod --enabled=false |
| Re-uploaded same key → duplicate embeddings | K3 fires the pipeline on every PUT (incl. overwrites) | Use UpsertVectors semantics (not available for pipeline-mode); for dedup, post-process via the external-collection flow |
Variations — other vector templates
Same recipe, different template:
| Want to index | Template | Notes |
|---|---|---|
| Source code | code_embedding_index | AST-aware chunking via tree-sitter (rust / python / js / ts / go / java / cpp / …) |
| Mixed media (images / video / audio / PDF page renders) | visual_embedding_index | Multimodal — useful for combined visual + textual recall |
| Faces in photos | face_embedding_index | SCRFD detection + per-face crops + embed |
| Objects in images (open-vocab) | object_embedding_index | Requires labels runtime input |
All four are pipeline-mode with the same atomic create + auto-rule flow.
Cleanup
# 1. Disable the rule first (stops new ingests)
dodil k3 ingest update "$RULE_ID" -b kb-prod --enabled=false
# 2. Delete the rule + pipeline + collection (any order)
dodil k3 ingest delete "$RULE_ID" -b kb-prod
dodil k3 pipeline delete "$PIPELINE_ID" -b kb-prod
dodil k3 vector collection delete "$COLLECTION_ID" -b kb-prod
# 3. (Optional) Delete the bucket
dodil k3 object remove papers/attention.pdf -b kb-prod
dodil k3 object remove papers/intro.txt -b kb-prod
dodil k3 bucket delete kb-prodSee also
- External Collection — opposite shape: you own the embedding pipeline, push vectors directly
- Multi-collection Search — search docs + code + assets in one query
- Hybrid + Rerank — improve precision on the pipeline-mode collection from this recipe
- Pipelines → Replay & Retry — recover from failed ingests
- Templates — API Reference — vector catalog spec