Multimodal Search
Goal: search a vector collection by an object in your bucket — an image to find similar images, an audio clip to find similar sounds, a video frame to find similar videos. K3 fetches the file, embeds it server-side, then searches.
Template used: visual_embedding_index — handles image / video frames / audio spectrograms / PDF page renders. Also documented: face_embedding_index (faces) and object_embedding_index (open-vocab object detection).
Shape:
query file (S3 key)
│
▼
K3 fetches the object → routes to right embedder based on content_type
│
▼
server-side embedding (visual / face / object)
│
▼
Search RPC against the collection
│
▼
results (other objects' chunks ranked by visual similarity)
│
▼
(optional) Jina rerank against `rerank_text` for "look like X AND match description Y"Prerequisites
-
dodilCLI +dodil login -
A bucket with the vector engine configured (
automode) -
A visual-template-backed collection:
dodil k3 vector collection add product-images -b kb-prod \ --description "Product photo library" \ --template visual_embedding_index -
Some images uploaded — the auto-generated ingest rule fires and embeds them:
dodil k3 object create ./img-001.jpg -b kb-prod -k products/img-001.jpg dodil k3 object create ./img-002.jpg -b kb-prod -k products/img-002.jpg # ... etc
Verify the ingest jobs completed:
PIPELINE_ID=$(dodil k3 vector collection get product-images -b kb-prod -o json | jq -r '.embedPipelineId')
dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json \
| jq '.jobs[] | {object: .object.key, status, chunksCreated}'1. Search by an image-in-the-bucket
Upload a query image (or use one already in the bucket), then search using its S3 key:
# Stage a query image
dodil k3 object create ./example-bag.jpg -b kb-prod -k queries/example-bag.jpg
# Search visually similar product images
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "product-images",
"s3Key": "queries/example-bag.jpg",
"contentType": "image/jpeg",
"topK": 20
}' | jq '.results[] | {score, object: .object.key}'Sample response (top 5):
[
{"score": 0.94, "object": "products/img-042.jpg"},
{"score": 0.91, "object": "products/img-128.jpg"},
{"score": 0.88, "object": "products/img-007.jpg"},
{"score": 0.85, "object": "products/img-091.jpg"},
{"score": 0.81, "object": "products/img-205.jpg"}
]Why content_type is a hint
K3 routes the file to different embedders based on content_type:
image/*→ visual embedderaudio/*→ audio embedder (spectrogram-based)video/*→ video embedder (frame sampling)application/pdf→ PDF page-render embedder
K3 tries the object’s stored content-type first; if it’s missing / wrong, the contentType field on the request overrides. Common pattern — your uploads stored application/octet-stream instead of image/jpeg, and the request-side hint corrects:
# Object stored without content_type; tell K3 it's a JPEG
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "product-images",
"s3Key": "raw-uploads/some-file",
"contentType": "image/jpeg",
"topK": 20
}'2. Combine visual + text — rerank_text
visual_embedding_index only ranks by visual similarity. To also match a textual description (“brown bag with gold hardware”), supply rerank_text — the reranker scores each result against the text:
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "product-images",
"s3Key": "queries/example-bag.jpg",
"contentType": "image/jpeg",
"topK": 20,
"rerank": true,
"rerankText": "brown leather handbag with gold hardware"
}'This is the pattern for “products that look like this AND match this description” workflows — image-first retrieval (high recall on visual similarity), then cross-modal reranking (high precision on text match).
Without
rerank_text,rerank: trueerrors out ons3_keyqueries — the reranker needs text to score against.
3. Audio + video queries
Same RPC, different modality. Setup:
# Audio collection — uses visual_embedding_index too (spectrogram-based for audio)
dodil k3 vector collection add audio-library -b kb-prod \
--template visual_embedding_index
# Ingest audio clips
dodil k3 object create ./clip-01.mp3 -b kb-prod -k audio/clip-01.mp3
# Wait for ingest...Query by audio:
dodil k3 object create ./query-clip.wav -b kb-prod -k queries/query-clip.wav
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "audio-library",
"s3Key": "queries/query-clip.wav",
"contentType": "audio/wav",
"topK": 10
}' | jq '.results[] | {score, object: .object.key}'Video queries work the same — K3 samples frames + embeds them:
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "video-library",
"s3Key": "queries/example-clip.mp4",
"contentType": "video/mp4",
"topK": 10
}'4. Face search (face_embedding_index)
For a face-recognition use case, use the face-specific template — SCRFD detects faces, then embeds each crop:
# Index a face library
dodil k3 vector collection add faces -b kb-prod \
--template face_embedding_index
# Upload member photos
dodil k3 object create ./member-001.jpg -b kb-prod -k members/member-001.jpg
dodil k3 object create ./member-002.jpg -b kb-prod -k members/member-002.jpg
# ... etc
# Wait for ingest
PIPELINE_ID=$(dodil k3 vector collection get faces -b kb-prod -o json | jq -r '.embedPipelineId')
dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json | jq '.jobs[] | .status' | sort -u
# Search — upload a query photo
dodil k3 object create ./unknown-face.jpg -b kb-prod -k queries/unknown-face.jpg
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "faces",
"s3Key": "queries/unknown-face.jpg",
"contentType": "image/jpeg",
"topK": 5,
"minScore": 0.75
}'Set min_score aggressively for face matching — false positives at lower scores are costly. 0.75–0.85 is a typical threshold for “same person” depending on the embedder.
5. Object detection — open-vocabulary
object_embedding_index lets you detect objects with a caller-defined vocabulary — useful when you want to index products / parts / specific things rather than COCO classes:
# Collection requires `labels` template input — use API
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/pipelines" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"name": "products-by-object",
"templateId": "object_embedding_index",
"templateInputs": {
"labels": ["bottle", "bag", "shoe", "watch", "jewelry", "headphones"]
}
}'
# Upload + search same as visual_embedding_index
dodil k3 object create ./product-shelf.jpg -b kb-prod -k products/shelf-1.jpg
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "products-by-object",
"s3Key": "queries/find-bags.jpg",
"topK": 20
}'Detected objects + their crops are embedded; the search returns object-level matches, not whole-image matches.
6. Pre-filter + multimodal
pre_filter works on multimodal queries just like text queries. Filter by metadata before visual retrieval:
# Find similar bags, but only from the SS25 collection
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "product-images",
"s3Key": "queries/example-bag.jpg",
"contentType": "image/jpeg",
"topK": 20,
"preFilter": {
"op": "LOGICAL_OP_AND",
"filters": [
{ "field": "season", "op": "FILTER_OP_EQ", "value": "SS25" },
{ "field": "in_stock", "op": "FILTER_OP_EQ", "value": "true" }
]
}
}'Metadata fields available depend on the template — for visual_embedding_index the index template emits standard fields (source_key, chunk_index, mime_type, extracted_at). For richer metadata, use external collections and supply your own metadata per record.
Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
INVALID_ARGUMENT: could not embed file | content_type is wrong / missing and stored content-type is generic (application/octet-stream) | Pass contentType explicitly on the request — overrides the stored value |
| Image query returns text-collection results | Multi-collection search picked a text group | Set collectionName explicitly OR check collectionStatuses for which group was active |
| Face query returns 0 results despite a clear face | SCRFD didn’t detect a face (too small, low contrast, angled) | Pre-crop the query image to ~ face-only; try a higher-res version |
rerank: true on s3_key without rerank_text errors | Reranker needs text for binary queries | Supply rerank_text describing what you’re looking for textually |
| Audio search returns 0 results | Audio embedder fails on unsupported codec (rare formats) | Convert to common codecs (mp3, wav, ogg) before uploading |
| Video query is slow | K3 samples + embeds multiple frames | Shorten the query video clip OR pre-extract a representative frame and use image search instead |
| Object-detection collection returns no matches for an obvious object | The object wasn’t in the labels vocabulary passed at create time | Recreate the collection with the additional label — open-vocabulary models only detect what you tell them to |
Performance notes
| Operation | Typical latency |
|---|---|
| Single image embed + search | ~150–300 ms (embedding dominates) |
| Audio embed + search (30 s clip) | ~400–800 ms |
| Video embed + search (5 s clip, 8 frames sampled) | ~700–1500 ms |
+ rerank: true (top-K 20) | + 100–250 ms |
Server-side embedding of the query file is the long pole for multimodal queries — much slower than text queries. For latency-sensitive workflows, pre-embed query files on your side and use the vector query shape via External Collection.
See also
- Search — API Reference → Shape 3 (file query) — spec for
s3_key - Search — API Reference → Rerank —
rerank_textfor binary queries - Pipeline Collection — same recipe shape for text-based collections
- External Collection — for when you want to control query-side embedding yourself
- Hybrid + Rerank — adapt the rerank tier for multimodal use cases
- Templates — API Reference — vector catalog (visual / face / object index templates)