Mixed-Media Library
Goal: a single bucket holds images, videos, audio clips, and PDFs; K3 embeds them all into one Vector collection; you search by file (image-by-image, audio-by-audio) or by text (“a brown leather handbag”). Visual + cross-modal retrieval, no glue code.
Primitives used: Storage (mixed-media uploads via S3 SDK or CLI) → Pipelines (auto-rule wired by Vector’s visual_embedding_index template) → Vector (one collection, multimodal search via s3_key and text shapes).
Shape:
images / video / audio / PDFs ──upload──► Storage bucket
│
▼ auto-rule fires
visual_embedding_index Scriptum
(handles each modality:
- image → CLIP-style embed
- video → sampled frame embeds
- audio → spectrogram embed
- pdf → page-render embeds)
│
▼
Vector — `assets` collection
(one collection, multimodal)
│
┌─────────────────────────┴─────────────────────────┐
│ │
▼ ▼
Search by S3 key (file query) Search by text
"find similar to this image" "brown handbag"Prerequisites
dodilCLI +dodil login- A bucket —
kb-platform:dodil k3 bucket create kb-platform -d "Mixed-media asset library" - Vector engine configured:
dodil k3 vector store create -b kb-platform -m auto # Wait for ACTIVE
1. Create the multimodal collection
visual_embedding_index handles all four modalities — image, video frames, audio spectrograms, and PDF page renders — through one collection. Schema and embed_model come from the template’s contract.
dodil k3 vector collection add assets -b kb-platform \
--description "Mixed-media library — images / video / audio / PDFs" \
--template visual_embedding_index
export COLLECTION_ID=$(dodil k3 vector collection get assets -b kb-platform -o json | jq -r '.collectionId')
export PIPELINE_ID=$(dodil k3 vector collection get assets -b kb-platform -o json | jq -r '.embedPipelineId')Inspect the resolved schema after first ingest:
dodil k3 vector collection get assets -b kb-platform -o json \
| jq '{name, dimensions, embeddingType, distanceMetric, sparseMode, modality, embedModel}'For visual_embedding_index you’ll see modality: "visual", sparseMode: "SPARSE_MODE_NONE" (visual collections are dense-only — hybrid BM25 doesn’t apply to non-text modalities). embed_model will be the visual embedder K3 ships (e.g. a CLIP-class model).
Inspect the auto-rule’s globs — they should cover image / video / audio / PDF extensions:
dodil k3 ingest list -b kb-platform -p "$PIPELINE_ID" -o json \
| jq '.rules[] | {ruleId, includePatterns, includeMimeTypes, enabled}'2. Upload mixed media — pick your client
Bulk via aws-cli
The fastest for a real corpus. K3 speaks native S3 — see Storage → S3 Compatibility for setup.
# Set up profile once
aws configure --profile dodil-k3
export AWS_PROFILE=dodil-k3
export AWS_ENDPOINT_URL=https://k3.dev.dodil.io
# Sync a folder of mixed assets
aws s3 sync ./local-assets/ s3://kb-platform/library/ \
--content-type "image/jpeg" \
--exclude "*" --include "*.jpg" --include "*.jpeg"
aws s3 sync ./local-assets/ s3://kb-platform/library/ \
--content-type "video/mp4" \
--exclude "*" --include "*.mp4"
aws s3 sync ./local-assets/ s3://kb-platform/library/ \
--content-type "audio/wav" \
--exclude "*" --include "*.wav"Setting
--content-typeper upload matters. K3’s auto-rule matches on MIME types; the visual pipeline also dispatches to different embedders based on content-type (image vs video vs audio). For mass-uploads where MIME is wrong, fix it before ingest fires (or override on the search side withcontentTypeon the query — limited to per-query corrections).
Per-file via CLI
# Image — content-type inferred from extension
dodil k3 object create ./bag-001.jpg -b kb-platform -k library/bags/bag-001.jpg
# Audio with explicit metadata
dodil k3 object create ./audio-001.mp3 -b kb-platform -k library/audio/audio-001.mp3
# Video
dodil k3 object create ./clip-001.mp4 -b kb-platform -k library/video/clip-001.mp4
# PDF (treated as page renders by visual_embedding_index)
dodil k3 object create ./catalog.pdf -b kb-platform -k library/pdf/catalog.pdf3. Watch the ingest jobs
dodil k3 ingest jobs -b kb-platform -p "$PIPELINE_ID" -o json \
| jq '.jobs[] | {object: .object.key, status, chunksCreated, embeddingsWritten}'Modality-specific behavior to expect:
| Modality | chunksCreated typical | Notes |
|---|---|---|
| Image (jpg / png) | 1 | one embedding per image |
| Video (mp4) | 8–24 | sampled frames → one embedding each |
| Audio (wav / mp3) | 3–10 | spectrogram windows → one embedding per window |
| N (= pages) | one page-render embedding per page |
If embeddings_written is consistently lower than chunks_created, see Pipelines → Replay & Retry.
4. Search
A. By an image in the bucket — file query
Upload a query image (or use one already in the bucket), then search using its S3 key:
# Stage the query image
dodil k3 object create ./query-bag.jpg -b kb-platform -k queries/query-bag.jpg
# Search visually similar assets
curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-platform",
"collectionName": "assets",
"s3Key": "queries/query-bag.jpg",
"contentType": "image/jpeg",
"topK": 20
}' | jq '.results[] | {score, object: .object.key}'contentType is a hint — K3 uses it to route to the right embedder when the object’s stored content-type is missing / ambiguous.
B. By text
Visual embedders trained CLIP-style (or similar) understand text-to-image matching directly. Send a text description; K3 embeds it on the same vector axis as your images:
curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-platform",
"collectionName": "assets",
"text": "brown leather handbag with gold hardware",
"topK": 20
}' | jq '.results[] | {score, object: .object.key}'Quality of text-to-image recall depends on the underlying visual embedder. Test both shapes against your corpus to see which works better for your queries.
C. File + rerank against text — “look like X AND match description Y”
The killer combo for product-image search, content moderation, and visual-Q&A:
curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-platform",
"collectionName": "assets",
"s3Key": "queries/query-bag.jpg",
"contentType": "image/jpeg",
"topK": 50,
"rerank": true,
"rerankText": "brown leather handbag with gold hardware"
}' | jq '.results[] | {score, object: .object.key}'K3 retrieves the top-50 by visual similarity, then reranks each candidate against the supplied text — promoting candidates that match both. See Vector → Multimodal Search for more rerank-text patterns.
D. Audio similarity — same shape
dodil k3 object create ./query-sound.wav -b kb-platform -k queries/query-sound.wav
curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-platform",
"collectionName": "assets",
"s3Key": "queries/query-sound.wav",
"contentType": "audio/wav",
"topK": 10
}' | jq '.results[] | {score, object: .object.key}'E. Pre-filter by sub-folder
If you’ve organized assets by path (library/bags/, library/shoes/, …), filter results to one category:
curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-platform",
"collectionName": "assets",
"s3Key": "queries/query-bag.jpg",
"topK": 20,
"preFilter": {
"op": "LOGICAL_OP_AND",
"filters": [
{ "field": "source_key", "op": "FILTER_OP_CONTAINS", "value": "library/bags/" }
]
}
}'5. Display results — building a product-search UI
# Sketch: build a "more like this" UI
import requests
def find_similar(s3_key: str, description: str | None = None, top_k: int = 20):
body = {
"bucket": "kb-platform",
"collectionName": "assets",
"s3Key": s3_key,
"contentType": "image/jpeg",
"topK": top_k,
}
if description:
body["rerank"] = True
body["rerankText"] = description
r = requests.post(
"https://k3.dev.dodil.io/kb-platform/vector/search",
headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"},
json=body,
).json()
# Each result has object.key — turn into a presigned download URL for the UI
return [
{
"score": r["score"],
"key": r["object"]["key"],
"url": presigned_url(r["object"]["bucket"], r["object"]["key"]),
}
for r in r["results"]
]
def presigned_url(bucket: str, key: str) -> str:
# Use Storage's GetObjectUrl — see /storage/api-reference/objects#getobjecturl
r = requests.post(
f"https://k3.dev.dodil.io/{bucket}/objects/{key}/url",
headers={"Authorization": f"Bearer {TOKEN}"},
).json()
return r["url"]The presigned URL flow is critical for UIs — Storage’s GetObjectUrl returns a K3-signed URL good for 3600 s by default; you don’t need to proxy bytes through your app.
Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
| Audio uploaded but no embedding written | Audio embedder rejects rare codecs (e.g. very old AMR) | Transcode to common formats (wav, mp3, ogg) before upload |
| Video search hits one frame consistently — others ignored | Frame-sampling is sparse; corpus has near-identical frames | Increase sampling rate via template options (use API — CLI doesn’t expose template options yet) |
| Text-to-image queries are worse than expected | Visual embedder isn’t CLIP-quality on your domain | Test domain-tuned embedders; consider an External Collection with a custom embedder |
rerank: true on s3_key query errors | Reranker needs text to score against — required for binary queries | Supply rerank_text describing what you’re matching |
| Different objects with same content (e.g. JPEG + WebP of the same photo) rank apart | Hash differs but embeddings should be close | This is correct — embeddings are content-aware. If you want exact dedup, use object eTags in metadata |
| PDF page-render search returns whole-pdf results without page granularity | PDF chunks are per-page; each result’s chunkIndex IS the page number | Display chunkIndex in the UI; use it to deep-link to the page in your viewer |
| Latency for video queries is 700-1500 ms | Server-side video frame extraction + embedding | Pre-extract a representative frame on your side; use External Collection for query-side control |
Cleanup
RULE_ID=$(dodil k3 ingest list -b kb-platform -p "$PIPELINE_ID" -o json | jq -r '.rules[0].ruleId')
dodil k3 ingest update "$RULE_ID" -b kb-platform --enabled=false
dodil k3 ingest delete "$RULE_ID" -b kb-platform
dodil k3 pipeline delete "$PIPELINE_ID" -b kb-platform
dodil k3 vector collection delete "$COLLECTION_ID" -b kb-platform
dodil k3 vector store delete -b kb-platform
dodil k3 bucket delete kb-platformSee also
- Vector → Multimodal Search — deeper on the multimodal patterns (face-search, open-vocab object detection)
- Storage → S3 Compatibility — bulk-upload via aws-cli / SDKs
- Storage → Browser Upload — presigned PUT for direct-from-browser uploads (great for user-generated content libraries)
- Vector → External Collection — when you want a different visual embedder than K3’s default
- RAG Knowledge Base — same shape but for text; pair both for a multimodal-RAG product