Multimodal Search

Goal: search a vector collection by an object in your bucket — an image to find similar images, an audio clip to find similar sounds, a video frame to find similar videos. K3 fetches the file, embeds it server-side, then searches.

Template used: visual_embedding_index — handles image / video frames / audio spectrograms / PDF page renders. Also documented: face_embedding_index (faces) and object_embedding_index (open-vocab object detection).

Shape:


   query file (S3 key)
          │
          ▼
   K3 fetches the object  →  routes to right embedder based on content_type
          │
          ▼
   server-side embedding (visual / face / object)
          │
          ▼
   Search RPC against the collection
          │
          ▼
   results (other objects' chunks ranked by visual similarity)
          │
          ▼
   (optional) Jina rerank against `rerank_text` for "look like X AND match description Y"

Prerequisites

dodil CLI + dodil login
A bucket with the vector engine configured (auto mode)

A visual-template-backed collection:


dodil k3 vector collection add product-images -b kb-prod \
  --description "Product photo library" \
  --template visual_embedding_index

Some images uploaded — the auto-generated ingest rule fires and embeds them:


dodil k3 object create ./img-001.jpg -b kb-prod -k products/img-001.jpg
dodil k3 object create ./img-002.jpg -b kb-prod -k products/img-002.jpg
# ... etc

Verify the ingest jobs completed:


PIPELINE_ID=$(dodil k3 vector collection get product-images -b kb-prod -o json | jq -r '.embedPipelineId')
dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json \
  | jq '.jobs[] | {object: .object.key, status, chunksCreated}'

1. Search by an image-in-the-bucket

Upload a query image (or use one already in the bucket), then search using its S3 key:


# Stage a query image
dodil k3 object create ./example-bag.jpg -b kb-prod -k queries/example-bag.jpg
 
# Search visually similar product images
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "product-images",
    "s3Key": "queries/example-bag.jpg",
    "contentType": "image/jpeg",
    "topK": 20
  }' | jq '.results[] | {score, object: .object.key}'

Sample response (top 5):


[
  {"score": 0.94, "object": "products/img-042.jpg"},
  {"score": 0.91, "object": "products/img-128.jpg"},
  {"score": 0.88, "object": "products/img-007.jpg"},
  {"score": 0.85, "object": "products/img-091.jpg"},
  {"score": 0.81, "object": "products/img-205.jpg"}
]

Why `content_type` is a hint

K3 routes the file to different embedders based on content_type:

image/* → visual embedder
audio/* → audio embedder (spectrogram-based)
video/* → video embedder (frame sampling)
application/pdf → PDF page-render embedder

K3 tries the object’s stored content-type first; if it’s missing / wrong, the contentType field on the request overrides. Common pattern — your uploads stored application/octet-stream instead of image/jpeg, and the request-side hint corrects:


# Object stored without content_type; tell K3 it's a JPEG
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "product-images",
    "s3Key": "raw-uploads/some-file",
    "contentType": "image/jpeg",
    "topK": 20
  }'

2. Combine visual + text — `rerank_text`

visual_embedding_index only ranks by visual similarity. To also match a textual description (“brown bag with gold hardware”), supply rerank_text — the reranker scores each result against the text:


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "product-images",
    "s3Key": "queries/example-bag.jpg",
    "contentType": "image/jpeg",
    "topK": 20,
    "rerank": true,
    "rerankText": "brown leather handbag with gold hardware"
  }'

This is the pattern for “products that look like this AND match this description” workflows — image-first retrieval (high recall on visual similarity), then cross-modal reranking (high precision on text match).

Without rerank_text, rerank: true errors out on s3_key queries — the reranker needs text to score against.

3. Audio + video queries

Same RPC, different modality. Setup:


# Audio collection — uses visual_embedding_index too (spectrogram-based for audio)
dodil k3 vector collection add audio-library -b kb-prod \
  --template visual_embedding_index
 
# Ingest audio clips
dodil k3 object create ./clip-01.mp3 -b kb-prod -k audio/clip-01.mp3
# Wait for ingest...

Query by audio:


dodil k3 object create ./query-clip.wav -b kb-prod -k queries/query-clip.wav
 
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "audio-library",
    "s3Key": "queries/query-clip.wav",
    "contentType": "audio/wav",
    "topK": 10
  }' | jq '.results[] | {score, object: .object.key}'

Video queries work the same — K3 samples frames + embeds them:


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "video-library",
    "s3Key": "queries/example-clip.mp4",
    "contentType": "video/mp4",
    "topK": 10
  }'

4. Face search (`face_embedding_index`)

For a face-recognition use case, use the face-specific template — SCRFD detects faces, then embeds each crop:


# Index a face library
dodil k3 vector collection add faces -b kb-prod \
  --template face_embedding_index
 
# Upload member photos
dodil k3 object create ./member-001.jpg -b kb-prod -k members/member-001.jpg
dodil k3 object create ./member-002.jpg -b kb-prod -k members/member-002.jpg
# ... etc
 
# Wait for ingest
PIPELINE_ID=$(dodil k3 vector collection get faces -b kb-prod -o json | jq -r '.embedPipelineId')
dodil k3 ingest jobs -b kb-prod -p "$PIPELINE_ID" -o json | jq '.jobs[] | .status' | sort -u
 
# Search — upload a query photo
dodil k3 object create ./unknown-face.jpg -b kb-prod -k queries/unknown-face.jpg
 
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "faces",
    "s3Key": "queries/unknown-face.jpg",
    "contentType": "image/jpeg",
    "topK": 5,
    "minScore": 0.75
  }'

Set min_score aggressively for face matching — false positives at lower scores are costly. 0.75–0.85 is a typical threshold for “same person” depending on the embedder.

5. Object detection — open-vocabulary

object_embedding_index lets you detect objects with a caller-defined vocabulary — useful when you want to index products / parts / specific things rather than COCO classes:


# Collection requires `labels` template input — use API
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/pipelines" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "name": "products-by-object",
    "templateId": "object_embedding_index",
    "templateInputs": {
      "labels": ["bottle", "bag", "shoe", "watch", "jewelry", "headphones"]
    }
  }'
 
# Upload + search same as visual_embedding_index
dodil k3 object create ./product-shelf.jpg -b kb-prod -k products/shelf-1.jpg
 
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "products-by-object",
    "s3Key": "queries/find-bags.jpg",
    "topK": 20
  }'

Detected objects + their crops are embedded; the search returns object-level matches, not whole-image matches.

6. Pre-filter + multimodal

pre_filter works on multimodal queries just like text queries. Filter by metadata before visual retrieval:


# Find similar bags, but only from the SS25 collection
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "product-images",
    "s3Key": "queries/example-bag.jpg",
    "contentType": "image/jpeg",
    "topK": 20,
    "preFilter": {
      "op": "LOGICAL_OP_AND",
      "filters": [
        { "field": "season", "op": "FILTER_OP_EQ", "value": "SS25" },
        { "field": "in_stock", "op": "FILTER_OP_EQ", "value": "true" }
      ]
    }
  }'

Metadata fields available depend on the template — for visual_embedding_index the index template emits standard fields (source_key, chunk_index, mime_type, extracted_at). For richer metadata, use external collections and supply your own metadata per record.

Common gotchas

Symptom	Cause	Fix
`INVALID_ARGUMENT: could not embed file`	`content_type` is wrong / missing and stored content-type is generic (`application/octet-stream`)	Pass `contentType` explicitly on the request — overrides the stored value
Image query returns text-collection results	Multi-collection search picked a text group	Set `collectionName` explicitly OR check `collectionStatuses` for which group was active
Face query returns 0 results despite a clear face	SCRFD didn’t detect a face (too small, low contrast, angled)	Pre-crop the query image to ~ face-only; try a higher-res version
`rerank: true` on s3_key without `rerank_text` errors	Reranker needs text for binary queries	Supply `rerank_text` describing what you’re looking for textually
Audio search returns 0 results	Audio embedder fails on unsupported codec (rare formats)	Convert to common codecs (mp3, wav, ogg) before uploading
Video query is slow	K3 samples + embeds multiple frames	Shorten the query video clip OR pre-extract a representative frame and use image search instead
Object-detection collection returns no matches for an obvious object	The object wasn’t in the `labels` vocabulary passed at create time	Recreate the collection with the additional label — open-vocabulary models only detect what you tell them to

Performance notes

Operation	Typical latency
Single image embed + search	~150–300 ms (embedding dominates)
Audio embed + search (30 s clip)	~400–800 ms
Video embed + search (5 s clip, 8 frames sampled)	~700–1500 ms
+ `rerank: true` (top-K 20)	+ 100–250 ms

Server-side embedding of the query file is the long pole for multimodal queries — much slower than text queries. For latency-sensitive workflows, pre-embed query files on your side and use the vector query shape via External Collection.