Skip to Content
We are live but in Staging 🎉
RecipesMixed-Media Library

Mixed-Media Library

Goal: a single bucket holds images, videos, audio clips, and PDFs; K3 embeds them all into one Vector collection; you search by file (image-by-image, audio-by-audio) or by text (“a brown leather handbag”). Visual + cross-modal retrieval, no glue code.

Primitives used: Storage (mixed-media uploads via S3 SDK or CLI) → Pipelines (auto-rule wired by Vector’s visual_embedding_index template) → Vector (one collection, multimodal search via s3_key and text shapes).

Shape:

images / video / audio / PDFs ──upload──► Storage bucket ▼ auto-rule fires visual_embedding_index Scriptum (handles each modality: - image → CLIP-style embed - video → sampled frame embeds - audio → spectrogram embed - pdf → page-render embeds) Vector — `assets` collection (one collection, multimodal) ┌─────────────────────────┴─────────────────────────┐ │ │ ▼ ▼ Search by S3 key (file query) Search by text "find similar to this image" "brown handbag"

Prerequisites

  • dodil CLI + dodil login
  • A bucket — kb-platform:
    dodil k3 bucket create kb-platform -d "Mixed-media asset library"
  • Vector engine configured:
    dodil k3 vector store create -b kb-platform -m auto # Wait for ACTIVE

1. Create the multimodal collection

visual_embedding_index handles all four modalities — image, video frames, audio spectrograms, and PDF page renders — through one collection. Schema and embed_model come from the template’s contract.

dodil k3 vector collection add assets -b kb-platform \ --description "Mixed-media library — images / video / audio / PDFs" \ --template visual_embedding_index export COLLECTION_ID=$(dodil k3 vector collection get assets -b kb-platform -o json | jq -r '.collectionId') export PIPELINE_ID=$(dodil k3 vector collection get assets -b kb-platform -o json | jq -r '.embedPipelineId')

Inspect the resolved schema after first ingest:

dodil k3 vector collection get assets -b kb-platform -o json \ | jq '{name, dimensions, embeddingType, distanceMetric, sparseMode, modality, embedModel}'

For visual_embedding_index you’ll see modality: "visual", sparseMode: "SPARSE_MODE_NONE" (visual collections are dense-only — hybrid BM25 doesn’t apply to non-text modalities). embed_model will be the visual embedder K3 ships (e.g. a CLIP-class model).

Inspect the auto-rule’s globs — they should cover image / video / audio / PDF extensions:

dodil k3 ingest list -b kb-platform -p "$PIPELINE_ID" -o json \ | jq '.rules[] | {ruleId, includePatterns, includeMimeTypes, enabled}'

2. Upload mixed media — pick your client

Bulk via aws-cli

The fastest for a real corpus. K3 speaks native S3 — see Storage → S3 Compatibility for setup.

# Set up profile once aws configure --profile dodil-k3 export AWS_PROFILE=dodil-k3 export AWS_ENDPOINT_URL=https://k3.dev.dodil.io # Sync a folder of mixed assets aws s3 sync ./local-assets/ s3://kb-platform/library/ \ --content-type "image/jpeg" \ --exclude "*" --include "*.jpg" --include "*.jpeg" aws s3 sync ./local-assets/ s3://kb-platform/library/ \ --content-type "video/mp4" \ --exclude "*" --include "*.mp4" aws s3 sync ./local-assets/ s3://kb-platform/library/ \ --content-type "audio/wav" \ --exclude "*" --include "*.wav"

Setting --content-type per upload matters. K3’s auto-rule matches on MIME types; the visual pipeline also dispatches to different embedders based on content-type (image vs video vs audio). For mass-uploads where MIME is wrong, fix it before ingest fires (or override on the search side with contentType on the query — limited to per-query corrections).

Per-file via CLI

# Image — content-type inferred from extension dodil k3 object create ./bag-001.jpg -b kb-platform -k library/bags/bag-001.jpg # Audio with explicit metadata dodil k3 object create ./audio-001.mp3 -b kb-platform -k library/audio/audio-001.mp3 # Video dodil k3 object create ./clip-001.mp4 -b kb-platform -k library/video/clip-001.mp4 # PDF (treated as page renders by visual_embedding_index) dodil k3 object create ./catalog.pdf -b kb-platform -k library/pdf/catalog.pdf

3. Watch the ingest jobs

dodil k3 ingest jobs -b kb-platform -p "$PIPELINE_ID" -o json \ | jq '.jobs[] | {object: .object.key, status, chunksCreated, embeddingsWritten}'

Modality-specific behavior to expect:

ModalitychunksCreated typicalNotes
Image (jpg / png)1one embedding per image
Video (mp4)8–24sampled frames → one embedding each
Audio (wav / mp3)3–10spectrogram windows → one embedding per window
PDFN (= pages)one page-render embedding per page

If embeddings_written is consistently lower than chunks_created, see Pipelines → Replay & Retry.

A. By an image in the bucket — file query

Upload a query image (or use one already in the bucket), then search using its S3 key:

# Stage the query image dodil k3 object create ./query-bag.jpg -b kb-platform -k queries/query-bag.jpg # Search visually similar assets curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-platform", "collectionName": "assets", "s3Key": "queries/query-bag.jpg", "contentType": "image/jpeg", "topK": 20 }' | jq '.results[] | {score, object: .object.key}'

contentType is a hint — K3 uses it to route to the right embedder when the object’s stored content-type is missing / ambiguous.

B. By text

Visual embedders trained CLIP-style (or similar) understand text-to-image matching directly. Send a text description; K3 embeds it on the same vector axis as your images:

curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-platform", "collectionName": "assets", "text": "brown leather handbag with gold hardware", "topK": 20 }' | jq '.results[] | {score, object: .object.key}'

Quality of text-to-image recall depends on the underlying visual embedder. Test both shapes against your corpus to see which works better for your queries.

C. File + rerank against text — “look like X AND match description Y”

The killer combo for product-image search, content moderation, and visual-Q&A:

curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-platform", "collectionName": "assets", "s3Key": "queries/query-bag.jpg", "contentType": "image/jpeg", "topK": 50, "rerank": true, "rerankText": "brown leather handbag with gold hardware" }' | jq '.results[] | {score, object: .object.key}'

K3 retrieves the top-50 by visual similarity, then reranks each candidate against the supplied text — promoting candidates that match both. See Vector → Multimodal Search for more rerank-text patterns.

D. Audio similarity — same shape

dodil k3 object create ./query-sound.wav -b kb-platform -k queries/query-sound.wav curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-platform", "collectionName": "assets", "s3Key": "queries/query-sound.wav", "contentType": "audio/wav", "topK": 10 }' | jq '.results[] | {score, object: .object.key}'

E. Pre-filter by sub-folder

If you’ve organized assets by path (library/bags/, library/shoes/, …), filter results to one category:

curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-platform", "collectionName": "assets", "s3Key": "queries/query-bag.jpg", "topK": 20, "preFilter": { "op": "LOGICAL_OP_AND", "filters": [ { "field": "source_key", "op": "FILTER_OP_CONTAINS", "value": "library/bags/" } ] } }'

5. Display results — building a product-search UI

# Sketch: build a "more like this" UI import requests def find_similar(s3_key: str, description: str | None = None, top_k: int = 20): body = { "bucket": "kb-platform", "collectionName": "assets", "s3Key": s3_key, "contentType": "image/jpeg", "topK": top_k, } if description: body["rerank"] = True body["rerankText"] = description r = requests.post( "https://k3.dev.dodil.io/kb-platform/vector/search", headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}, json=body, ).json() # Each result has object.key — turn into a presigned download URL for the UI return [ { "score": r["score"], "key": r["object"]["key"], "url": presigned_url(r["object"]["bucket"], r["object"]["key"]), } for r in r["results"] ] def presigned_url(bucket: str, key: str) -> str: # Use Storage's GetObjectUrl — see /storage/api-reference/objects#getobjecturl r = requests.post( f"https://k3.dev.dodil.io/{bucket}/objects/{key}/url", headers={"Authorization": f"Bearer {TOKEN}"}, ).json() return r["url"]

The presigned URL flow is critical for UIs — Storage’s GetObjectUrl returns a K3-signed URL good for 3600 s by default; you don’t need to proxy bytes through your app.

Common gotchas

SymptomCauseFix
Audio uploaded but no embedding writtenAudio embedder rejects rare codecs (e.g. very old AMR)Transcode to common formats (wav, mp3, ogg) before upload
Video search hits one frame consistently — others ignoredFrame-sampling is sparse; corpus has near-identical framesIncrease sampling rate via template options (use API — CLI doesn’t expose template options yet)
Text-to-image queries are worse than expectedVisual embedder isn’t CLIP-quality on your domainTest domain-tuned embedders; consider an External Collection with a custom embedder
rerank: true on s3_key query errorsReranker needs text to score against — required for binary queriesSupply rerank_text describing what you’re matching
Different objects with same content (e.g. JPEG + WebP of the same photo) rank apartHash differs but embeddings should be closeThis is correct — embeddings are content-aware. If you want exact dedup, use object eTags in metadata
PDF page-render search returns whole-pdf results without page granularityPDF chunks are per-page; each result’s chunkIndex IS the page numberDisplay chunkIndex in the UI; use it to deep-link to the page in your viewer
Latency for video queries is 700-1500 msServer-side video frame extraction + embeddingPre-extract a representative frame on your side; use External Collection for query-side control

Cleanup

RULE_ID=$(dodil k3 ingest list -b kb-platform -p "$PIPELINE_ID" -o json | jq -r '.rules[0].ruleId') dodil k3 ingest update "$RULE_ID" -b kb-platform --enabled=false dodil k3 ingest delete "$RULE_ID" -b kb-platform dodil k3 pipeline delete "$PIPELINE_ID" -b kb-platform dodil k3 vector collection delete "$COLLECTION_ID" -b kb-platform dodil k3 vector store delete -b kb-platform dodil k3 bucket delete kb-platform

See also