Mixed-Media Library

Goal: a single bucket holds images, videos, audio clips, and PDFs; K3 embeds them all into one Vector collection; you search by file (image-by-image, audio-by-audio) or by text (“a brown leather handbag”). Visual + cross-modal retrieval, no glue code.

Primitives used: Storage (mixed-media uploads via S3 SDK or CLI) → Pipelines (auto-rule wired by Vector’s visual_embedding_index template) → Vector (one collection, multimodal search via s3_key and text shapes).

Shape:


   images / video / audio / PDFs  ──upload──►  Storage bucket
                                                    │
                                                    ▼  auto-rule fires
                                          visual_embedding_index Scriptum
                                            (handles each modality:
                                             - image → CLIP-style embed
                                             - video → sampled frame embeds
                                             - audio → spectrogram embed
                                             - pdf   → page-render embeds)
                                                    │
                                                    ▼
                                          Vector — `assets` collection
                                          (one collection, multimodal)
                                                    │
                          ┌─────────────────────────┴─────────────────────────┐
                          │                                                   │
                          ▼                                                   ▼
              Search by S3 key (file query)                         Search by text
              "find similar to this image"                         "brown handbag"

Prerequisites

dodil CLI + dodil login

A bucket — kb-platform:


dodil k3 bucket create kb-platform -d "Mixed-media asset library"

Vector engine configured:


dodil k3 vector store create -b kb-platform -m auto
# Wait for ACTIVE

1. Create the multimodal collection

visual_embedding_index handles all four modalities — image, video frames, audio spectrograms, and PDF page renders — through one collection. Schema and embed_model come from the template’s contract.


dodil k3 vector collection add assets -b kb-platform \
  --description "Mixed-media library — images / video / audio / PDFs" \
  --template visual_embedding_index
 
export COLLECTION_ID=$(dodil k3 vector collection get assets -b kb-platform -o json | jq -r '.collectionId')
export PIPELINE_ID=$(dodil k3 vector collection get assets -b kb-platform -o json | jq -r '.embedPipelineId')

Inspect the resolved schema after first ingest:


dodil k3 vector collection get assets -b kb-platform -o json \
  | jq '{name, dimensions, embeddingType, distanceMetric, sparseMode, modality, embedModel}'

For visual_embedding_index you’ll see modality: "visual", sparseMode: "SPARSE_MODE_NONE" (visual collections are dense-only — hybrid BM25 doesn’t apply to non-text modalities). embed_model will be the visual embedder K3 ships (e.g. a CLIP-class model).

Inspect the auto-rule’s globs — they should cover image / video / audio / PDF extensions:


dodil k3 ingest list -b kb-platform -p "$PIPELINE_ID" -o json \
  | jq '.rules[] | {ruleId, includePatterns, includeMimeTypes, enabled}'

2. Upload mixed media — pick your client

Bulk via aws-cli

The fastest for a real corpus. K3 speaks native S3 — see Storage → S3 Compatibility for setup.


# Set up profile once
aws configure --profile dodil-k3
export AWS_PROFILE=dodil-k3
export AWS_ENDPOINT_URL=https://k3.dev.dodil.io
 
# Sync a folder of mixed assets
aws s3 sync ./local-assets/ s3://kb-platform/library/ \
  --content-type "image/jpeg" \
  --exclude "*" --include "*.jpg" --include "*.jpeg"
aws s3 sync ./local-assets/ s3://kb-platform/library/ \
  --content-type "video/mp4" \
  --exclude "*" --include "*.mp4"
aws s3 sync ./local-assets/ s3://kb-platform/library/ \
  --content-type "audio/wav" \
  --exclude "*" --include "*.wav"

Setting --content-type per upload matters. K3’s auto-rule matches on MIME types; the visual pipeline also dispatches to different embedders based on content-type (image vs video vs audio). For mass-uploads where MIME is wrong, fix it before ingest fires (or override on the search side with contentType on the query — limited to per-query corrections).

Per-file via CLI


# Image — content-type inferred from extension
dodil k3 object create ./bag-001.jpg -b kb-platform -k library/bags/bag-001.jpg
 
# Audio with explicit metadata
dodil k3 object create ./audio-001.mp3 -b kb-platform -k library/audio/audio-001.mp3
 
# Video
dodil k3 object create ./clip-001.mp4 -b kb-platform -k library/video/clip-001.mp4
 
# PDF (treated as page renders by visual_embedding_index)
dodil k3 object create ./catalog.pdf -b kb-platform -k library/pdf/catalog.pdf

3. Watch the ingest jobs


dodil k3 ingest jobs -b kb-platform -p "$PIPELINE_ID" -o json \
  | jq '.jobs[] | {object: .object.key, status, chunksCreated, embeddingsWritten}'

Modality-specific behavior to expect:

Modality	`chunksCreated` typical	Notes
Image (jpg / png)	1	one embedding per image
Video (mp4)	8–24	sampled frames → one embedding each
Audio (wav / mp3)	3–10	spectrogram windows → one embedding per window
PDF	N (= pages)	one page-render embedding per page

If embeddings_written is consistently lower than chunks_created, see Pipelines → Replay & Retry.

4. Search

A. By an image in the bucket — file query

Upload a query image (or use one already in the bucket), then search using its S3 key:


# Stage the query image
dodil k3 object create ./query-bag.jpg -b kb-platform -k queries/query-bag.jpg
 
# Search visually similar assets
curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-platform",
    "collectionName": "assets",
    "s3Key": "queries/query-bag.jpg",
    "contentType": "image/jpeg",
    "topK": 20
  }' | jq '.results[] | {score, object: .object.key}'

contentType is a hint — K3 uses it to route to the right embedder when the object’s stored content-type is missing / ambiguous.

B. By text

Visual embedders trained CLIP-style (or similar) understand text-to-image matching directly. Send a text description; K3 embeds it on the same vector axis as your images:


curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-platform",
    "collectionName": "assets",
    "text": "brown leather handbag with gold hardware",
    "topK": 20
  }' | jq '.results[] | {score, object: .object.key}'

Quality of text-to-image recall depends on the underlying visual embedder. Test both shapes against your corpus to see which works better for your queries.

C. File + rerank against text — “look like X AND match description Y”

The killer combo for product-image search, content moderation, and visual-Q&A:


curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-platform",
    "collectionName": "assets",
    "s3Key": "queries/query-bag.jpg",
    "contentType": "image/jpeg",
    "topK": 50,
    "rerank": true,
    "rerankText": "brown leather handbag with gold hardware"
  }' | jq '.results[] | {score, object: .object.key}'

K3 retrieves the top-50 by visual similarity, then reranks each candidate against the supplied text — promoting candidates that match both. See Vector → Multimodal Search for more rerank-text patterns.

D. Audio similarity — same shape


dodil k3 object create ./query-sound.wav -b kb-platform -k queries/query-sound.wav
 
curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-platform",
    "collectionName": "assets",
    "s3Key": "queries/query-sound.wav",
    "contentType": "audio/wav",
    "topK": 10
  }' | jq '.results[] | {score, object: .object.key}'

E. Pre-filter by sub-folder

If you’ve organized assets by path (library/bags/, library/shoes/, …), filter results to one category:


curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-platform",
    "collectionName": "assets",
    "s3Key": "queries/query-bag.jpg",
    "topK": 20,
    "preFilter": {
      "op": "LOGICAL_OP_AND",
      "filters": [
        { "field": "source_key", "op": "FILTER_OP_CONTAINS", "value": "library/bags/" }
      ]
    }
  }'

5. Display results — building a product-search UI


# Sketch: build a "more like this" UI
import requests
 
def find_similar(s3_key: str, description: str | None = None, top_k: int = 20):
    body = {
        "bucket": "kb-platform",
        "collectionName": "assets",
        "s3Key": s3_key,
        "contentType": "image/jpeg",
        "topK": top_k,
    }
    if description:
        body["rerank"] = True
        body["rerankText"] = description
 
    r = requests.post(
        "https://k3.dev.dodil.io/kb-platform/vector/search",
        headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"},
        json=body,
    ).json()
 
    # Each result has object.key — turn into a presigned download URL for the UI
    return [
        {
            "score": r["score"],
            "key": r["object"]["key"],
            "url": presigned_url(r["object"]["bucket"], r["object"]["key"]),
        }
        for r in r["results"]
    ]
 
def presigned_url(bucket: str, key: str) -> str:
    # Use Storage's GetObjectUrl — see /storage/api-reference/objects#getobjecturl
    r = requests.post(
        f"https://k3.dev.dodil.io/{bucket}/objects/{key}/url",
        headers={"Authorization": f"Bearer {TOKEN}"},
    ).json()
    return r["url"]

The presigned URL flow is critical for UIs — Storage’s GetObjectUrl returns a K3-signed URL good for 3600 s by default; you don’t need to proxy bytes through your app.

Common gotchas

Symptom	Cause	Fix
Audio uploaded but no embedding written	Audio embedder rejects rare codecs (e.g. very old AMR)	Transcode to common formats (wav, mp3, ogg) before upload
Video search hits one frame consistently — others ignored	Frame-sampling is sparse; corpus has near-identical frames	Increase sampling rate via template options (use API — CLI doesn’t expose template options yet)
Text-to-image queries are worse than expected	Visual embedder isn’t CLIP-quality on your domain	Test domain-tuned embedders; consider an External Collection with a custom embedder
`rerank: true` on `s3_key` query errors	Reranker needs text to score against — required for binary queries	Supply `rerank_text` describing what you’re matching
Different objects with same content (e.g. JPEG + WebP of the same photo) rank apart	Hash differs but embeddings should be close	This is correct — embeddings are content-aware. If you want exact dedup, use object eTags in metadata
PDF page-render search returns whole-pdf results without page granularity	PDF chunks are per-page; each result’s `chunkIndex` IS the page number	Display `chunkIndex` in the UI; use it to deep-link to the page in your viewer
Latency for video queries is 700-1500 ms	Server-side video frame extraction + embedding	Pre-extract a representative frame on your side; use External Collection for query-side control

Cleanup


RULE_ID=$(dodil k3 ingest list -b kb-platform -p "$PIPELINE_ID" -o json | jq -r '.rules[0].ruleId')
 
dodil k3 ingest update "$RULE_ID" -b kb-platform --enabled=false
dodil k3 ingest delete "$RULE_ID" -b kb-platform
dodil k3 pipeline delete "$PIPELINE_ID" -b kb-platform
dodil k3 vector collection delete "$COLLECTION_ID" -b kb-platform
dodil k3 vector store delete -b kb-platform
dodil k3 bucket delete kb-platform