Skip to Content
We are live but in Staging 🎉
VectorRecipesMultimodal Search

Multimodal Search

Goal: search a vector collection by an object in your bucket — an image to find similar images, an audio clip to find similar sounds, a video frame to find similar videos. K3 fetches the file, embeds it server-side, then searches.

Template used: visual_embedding_index — handles image / video frames / audio spectrograms / PDF page renders. Also documented: face_embedding_index (faces) and object_embedding_index (open-vocab object detection).

Shape:

query file (S3 key) K3 fetches the object → routes to right embedder based on content_type server-side embedding (visual / face / object) Search RPC against the collection results (other objects' chunks ranked by visual similarity) (optional) Jina rerank against `rerank_text` for "look like X AND match description Y"

Prerequisites

  • dodil CLI + dodil login

  • A bucket with the vector engine configured (auto mode)

  • A visual-template-backed collection:

    dodil k3 vector collection add product-images -b kb-prod \ --description "Product photo library" \ --template visual_embedding_index
  • Some images uploaded — the auto-generated ingest rule fires and embeds them:

    dodil k3 object create ./img-001.jpg -b kb-prod -k products/img-001.jpg dodil k3 object create ./img-002.jpg -b kb-prod -k products/img-002.jpg # ... etc

Verify the ingest jobs completed:

PIPELINE_ID=$(dodil k3 vector collection get product-images -b kb-prod -o json | jq -r '.embedPipelineId') dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json \ | jq '.jobs[] | {object: .object.key, status, chunksCreated}'

1. Search by an image-in-the-bucket

Upload a query image (or use one already in the bucket), then search using its S3 key:

# Stage a query image dodil k3 object create ./example-bag.jpg -b kb-prod -k queries/example-bag.jpg # Search visually similar product images curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "product-images", "s3Key": "queries/example-bag.jpg", "contentType": "image/jpeg", "topK": 20 }' | jq '.results[] | {score, object: .object.key}'

Sample response (top 5):

[ {"score": 0.94, "object": "products/img-042.jpg"}, {"score": 0.91, "object": "products/img-128.jpg"}, {"score": 0.88, "object": "products/img-007.jpg"}, {"score": 0.85, "object": "products/img-091.jpg"}, {"score": 0.81, "object": "products/img-205.jpg"} ]

Why content_type is a hint

K3 routes the file to different embedders based on content_type:

  • image/* → visual embedder
  • audio/* → audio embedder (spectrogram-based)
  • video/* → video embedder (frame sampling)
  • application/pdf → PDF page-render embedder

K3 tries the object’s stored content-type first; if it’s missing / wrong, the contentType field on the request overrides. Common pattern — your uploads stored application/octet-stream instead of image/jpeg, and the request-side hint corrects:

# Object stored without content_type; tell K3 it's a JPEG curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "product-images", "s3Key": "raw-uploads/some-file", "contentType": "image/jpeg", "topK": 20 }'

2. Combine visual + text — rerank_text

visual_embedding_index only ranks by visual similarity. To also match a textual description (“brown bag with gold hardware”), supply rerank_text — the reranker scores each result against the text:

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "product-images", "s3Key": "queries/example-bag.jpg", "contentType": "image/jpeg", "topK": 20, "rerank": true, "rerankText": "brown leather handbag with gold hardware" }'

This is the pattern for “products that look like this AND match this description” workflows — image-first retrieval (high recall on visual similarity), then cross-modal reranking (high precision on text match).

Without rerank_text, rerank: true errors out on s3_key queries — the reranker needs text to score against.

3. Audio + video queries

Same RPC, different modality. Setup:

# Audio collection — uses visual_embedding_index too (spectrogram-based for audio) dodil k3 vector collection add audio-library -b kb-prod \ --template visual_embedding_index # Ingest audio clips dodil k3 object create ./clip-01.mp3 -b kb-prod -k audio/clip-01.mp3 # Wait for ingest...

Query by audio:

dodil k3 object create ./query-clip.wav -b kb-prod -k queries/query-clip.wav curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "audio-library", "s3Key": "queries/query-clip.wav", "contentType": "audio/wav", "topK": 10 }' | jq '.results[] | {score, object: .object.key}'

Video queries work the same — K3 samples frames + embeds them:

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "video-library", "s3Key": "queries/example-clip.mp4", "contentType": "video/mp4", "topK": 10 }'

4. Face search (face_embedding_index)

For a face-recognition use case, use the face-specific template — SCRFD detects faces, then embeds each crop:

# Index a face library dodil k3 vector collection add faces -b kb-prod \ --template face_embedding_index # Upload member photos dodil k3 object create ./member-001.jpg -b kb-prod -k members/member-001.jpg dodil k3 object create ./member-002.jpg -b kb-prod -k members/member-002.jpg # ... etc # Wait for ingest PIPELINE_ID=$(dodil k3 vector collection get faces -b kb-prod -o json | jq -r '.embedPipelineId') dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json | jq '.jobs[] | .status' | sort -u # Search — upload a query photo dodil k3 object create ./unknown-face.jpg -b kb-prod -k queries/unknown-face.jpg curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "faces", "s3Key": "queries/unknown-face.jpg", "contentType": "image/jpeg", "topK": 5, "minScore": 0.75 }'

Set min_score aggressively for face matching — false positives at lower scores are costly. 0.75–0.85 is a typical threshold for “same person” depending on the embedder.

5. Object detection — open-vocabulary

object_embedding_index lets you detect objects with a caller-defined vocabulary — useful when you want to index products / parts / specific things rather than COCO classes:

# Collection requires `labels` template input — use API curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/pipelines" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "name": "products-by-object", "templateId": "object_embedding_index", "templateInputs": { "labels": ["bottle", "bag", "shoe", "watch", "jewelry", "headphones"] } }' # Upload + search same as visual_embedding_index dodil k3 object create ./product-shelf.jpg -b kb-prod -k products/shelf-1.jpg curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "products-by-object", "s3Key": "queries/find-bags.jpg", "topK": 20 }'

Detected objects + their crops are embedded; the search returns object-level matches, not whole-image matches.

6. Pre-filter + multimodal

pre_filter works on multimodal queries just like text queries. Filter by metadata before visual retrieval:

# Find similar bags, but only from the SS25 collection curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "product-images", "s3Key": "queries/example-bag.jpg", "contentType": "image/jpeg", "topK": 20, "preFilter": { "op": "LOGICAL_OP_AND", "filters": [ { "field": "season", "op": "FILTER_OP_EQ", "value": "SS25" }, { "field": "in_stock", "op": "FILTER_OP_EQ", "value": "true" } ] } }'

Metadata fields available depend on the template — for visual_embedding_index the index template emits standard fields (source_key, chunk_index, mime_type, extracted_at). For richer metadata, use external collections and supply your own metadata per record.

Common gotchas

SymptomCauseFix
INVALID_ARGUMENT: could not embed filecontent_type is wrong / missing and stored content-type is generic (application/octet-stream)Pass contentType explicitly on the request — overrides the stored value
Image query returns text-collection resultsMulti-collection search picked a text groupSet collectionName explicitly OR check collectionStatuses for which group was active
Face query returns 0 results despite a clear faceSCRFD didn’t detect a face (too small, low contrast, angled)Pre-crop the query image to ~ face-only; try a higher-res version
rerank: true on s3_key without rerank_text errorsReranker needs text for binary queriesSupply rerank_text describing what you’re looking for textually
Audio search returns 0 resultsAudio embedder fails on unsupported codec (rare formats)Convert to common codecs (mp3, wav, ogg) before uploading
Video query is slowK3 samples + embeds multiple framesShorten the query video clip OR pre-extract a representative frame and use image search instead
Object-detection collection returns no matches for an obvious objectThe object wasn’t in the labels vocabulary passed at create timeRecreate the collection with the additional label — open-vocabulary models only detect what you tell them to

Performance notes

OperationTypical latency
Single image embed + search~150–300 ms (embedding dominates)
Audio embed + search (30 s clip)~400–800 ms
Video embed + search (5 s clip, 8 frames sampled)~700–1500 ms
+ rerank: true (top-K 20)+ 100–250 ms

Server-side embedding of the query file is the long pole for multimodal queries — much slower than text queries. For latency-sensitive workflows, pre-embed query files on your side and use the vector query shape via External Collection.

See also