Pipeline Collection

Goal: stand up a searchable vector collection where every uploaded document gets chunked + embedded + indexed automatically — no manual ingestion code.

Template used: text_embedding_index — the canonical text/PDF/docx/HTML/audio/video RAG-ingest template. The same pattern works for any *_embedding_index template in the vector catalog.

Shape:


   document upload  ──►  bucket  ──►  auto-generated ingest rule matches
                                                  │
                                                  ▼
                                          Scriptum pipeline runs
                                       (chunks + embeds via Ignite)
                                                  │
                                                  ▼
                                          Milvus collection
                                  (lazy-materialized on first ingest)
                                                  │
                                                  ▼
                                              Search RPC

Prerequisites

dodil CLI + dodil login done — CLI Basics

A bucket — kb-prod:


dodil k3 bucket create kb-prod -d "RAG corpus"

Vector engine configured on the bucket (auto mode):


dodil k3 vector store create -b kb-prod -m auto
# Wait for ACTIVE
dodil k3 vector store get -b kb-prod -o json | jq '.status'

1. Browse + pick a template


dodil k3 vector templates -o json | jq '.templates[] | {id, modalities, acceptedExtensions}'

For text + PDF + docx ingest, pick text_embedding_index. Inspect its contract to see what (if any) template_inputs it needs:


dodil k3 template get text_embedding_index -o json | jq '.contract.inputs'

text_embedding_index has no required runtime inputs — everything has a contract default. Other templates may require inputs:

Template	Required `template_inputs`	Notes
`text_embedding_index`	none	Chunk size + embed model come from contract defaults
`code_embedding_index`	none	Language auto-detected from `content_type` / extension
`visual_embedding_index`	none	Modality auto-routed (image / video / audio / pdf)
`face_embedding_index`	none	SCRFD face detection then embed each face crop
`object_embedding_index`	`labels: [string]`	Open-vocabulary detection — you supply the label vocabulary

2. Create the collection (pipeline-mode)

For text_embedding_index (no required inputs), the CLI works directly:


dodil k3 vector collection add docs -b kb-prod \
  --description "PDF / docx / HTML embeddings" \
  --template text_embedding_index \
  -o json
 
export COLLECTION_ID=$(dodil k3 vector collection get docs -b kb-prod -o json | jq -r '.collectionId')

For object_embedding_index (requires labels), use the API:


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/pipelines" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "name": "products",
    "description": "Product image objects",
    "templateId": "object_embedding_index",
    "templateInputs": {
      "labels": ["bottle", "bag", "shoe", "watch", "jewelry"]
    }
  }'

What K3 atomically created

AddVectorPipeline does three things in one call:

Created	Where it lives	What it does
`docs` collection row	Vector service	Schema lazy — Milvus collection materializes on first ingest
A Scriptum pipeline	Pipelines service	Bound to `text_embedding_index`; destination = the new collection
An auto-generated ingest rule	Pipelines service	Globs from the template’s `acceptedExtensions` (`*/.pdf`, `*/.txt`, …)

Inspect the bound pipeline:


PIPELINE_ID=$(dodil k3 vector collection get docs -b kb-prod -o json | jq -r '.embedPipelineId')
 
# The pipeline
dodil k3 pipeline get "$PIPELINE_ID" -b kb-prod -o json \
  | jq '{name, scriptumTemplate, storeEntityKind, storeEntityName}'
 
# The auto-rule
dodil k3 ingest list -b kb-prod -p "$PIPELINE_ID" -o json \
  | jq '.rules[] | {ruleId, name, includePatterns, enabled}'

3. Upload documents


# A PDF
curl -sSL https://arxiv.org/pdf/1706.03762.pdf -o attention.pdf
dodil k3 object create ./attention.pdf -b kb-prod -k papers/attention.pdf
 
# Plain text
cat > intro.txt <<'EOF'
The Transformer architecture introduced multi-head attention as a way for
the model to jointly attend to information from different representation
subspaces at different positions.
EOF
dodil k3 object create ./intro.txt -b kb-prod -k papers/intro.txt

Each upload fires through the auto-generated rule → spawns an ingest job → runs text_embedding_index → writes embeddings to the docs collection.

4. Watch the ingest jobs


dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json \
  | jq '.jobs[] | {object: .object.key, status, chunksCreated, embeddingsWritten}'

Status path: PENDING → PROCESSING → COMPLETED. When done:


[
  {"object": "papers/attention.pdf", "status": "INGEST_STATUS_COMPLETED", "chunksCreated": 47, "embeddingsWritten": 47},
  {"object": "papers/intro.txt",      "status": "INGEST_STATUS_COMPLETED", "chunksCreated": 1,  "embeddingsWritten": 1}
]

chunksCreated == embeddingsWritten is the happy path. If embeddingsWritten is lower, see Pipelines → Replay & Retry.

5. Search


# Simple text search
dodil k3 search "what is multi-head attention" -b kb-prod -c docs -o json \
  | jq '.results[] | {score, object: .object.key, content}'

For richer queries (hybrid mode, rerank, pre-filter, include_content), drop to API — see Recipes → Hybrid + Rerank:


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "text": "what is multi-head attention",
    "collectionName": "docs",
    "topK": 5,
    "searchMode": "SEARCH_MODE_AUTO",
    "rerank": true,
    "includeContent": true
  }' | jq '.results[] | {score, object: .object.key, content: (.content | .[0:200])}'

6. Inspect the resolved schema

After first ingest, the Milvus collection has materialized. Confirm the schema (dimensions, embed_model, sparse_mode) — useful for sanity checks before doing multi-collection search:


dodil k3 vector collection get docs -b kb-prod -o json \
  | jq '{
      status, dimensions, embeddingType, distanceMetric,
      sparseMode, embedModel, modality, milvusCollection
    }'

For text_embedding_index at the time of writing:


{
  "status": "COLLECTION_STATUS_ACTIVE",
  "dimensions": 1024,
  "embeddingType": "EMBEDDING_TYPE_FLOAT",
  "distanceMetric": "DISTANCE_METRIC_COSINE",
  "sparseMode": "SPARSE_MODE_BM25",
  "embedModel": "jina-embeddings-v4",
  "modality": "text",
  "milvusCollection": "kb_prod_docs_a1b2"
}

Common gotchas

Symptom	Cause	Fix
`vector collection add` fails with `engine not active`	Engine still PROVISIONING	`dodil k3 vector store get -b kb-prod -o json \| jq .status` — wait for ACTIVE
Upload succeeds but no ingest job spawns	The auto-generated rule’s globs don’t match the path/extension	List the rule (see step 2), confirm `includePatterns` covers your path
Jobs `COMPLETED` but `search` returns nothing	Milvus index still building (the first ingest is slower)	Wait 10–30 s after the first ingest, then re-search
`add` rejects with “template requires field ‘labels‘“	`object_embedding_index` — `labels` is required	Use the API call with `templateInputs.labels`
Need to pause ingestion without deleting	Use the bound rule’s enabled flag	`dodil k3 ingest update $RULE_ID -b kb-prod --enabled=false`
Re-uploaded same key → duplicate embeddings	K3 fires the pipeline on every PUT (incl. overwrites)	Use `UpsertVectors` semantics (not available for pipeline-mode); for dedup, post-process via the external-collection flow

Variations — other vector templates

Same recipe, different template:

Want to index	Template	Notes
Source code	`code_embedding_index`	AST-aware chunking via tree-sitter (rust / python / js / ts / go / java / cpp / …)
Mixed media (images / video / audio / PDF page renders)	`visual_embedding_index`	Multimodal — useful for combined visual + textual recall
Faces in photos	`face_embedding_index`	SCRFD detection + per-face crops + embed
Objects in images (open-vocab)	`object_embedding_index`	Requires `labels` runtime input

All four are pipeline-mode with the same atomic create + auto-rule flow.

Cleanup


# 1. Disable the rule first (stops new ingests)
dodil k3 ingest update "$RULE_ID" -b kb-prod --enabled=false
 
# 2. Delete the rule + pipeline + collection (any order)
dodil k3 ingest delete "$RULE_ID" -b kb-prod
dodil k3 pipeline delete "$PIPELINE_ID" -b kb-prod
dodil k3 vector collection delete "$COLLECTION_ID" -b kb-prod
 
# 3. (Optional) Delete the bucket
dodil k3 object remove papers/attention.pdf -b kb-prod
dodil k3 object remove papers/intro.txt -b kb-prod
dodil k3 bucket delete kb-prod