Skip to Content
We are live but in Staging 🎉
VectorRecipesHybrid + Rerank

Hybrid + Rerank

Goal: get the highest-precision top-K results from a vector collection. The trick is a 3-tier stack:

query ┌───────────────────────────────────────┐ │ Tier 1 — Dense vector retrieval │ recall via semantic similarity │ (Milvus HNSW / IVF over embeddings) │ └───────────────────────────────────────┘ ┌───────────────────────────────────────┐ │ Tier 2 — Sparse BM25 retrieval │ recall via token overlap │ (Milvus BM25 function) │ └───────────────────────────────────────┘ │ ──── fused via RRF (k=60) ────► merged candidates ┌───────────────────────────────────────┐ │ Tier 3 — Jina cross-encoder rerank │ precision via deep pairwise scoring │ (over the top-K candidates) │ └───────────────────────────────────────┘ final top-K

Each tier improves a different metric:

TierLatency costWhat it gives you
Dense only (SEARCH_MODE_VECTOR)~10–50 msStrong semantic recall; misses exact-keyword matches
Dense + BM25 (SEARCH_MODE_HYBRID)~15–80 msCatches both semantic + exact-token matches; balanced
+ Jina rerank+ 50–200 msHigh top-K precision — best for RAG / agent grounding

When to use each tier

Use caseRecommended stack
RAG with strict top-3 qualityHybrid + rerank
RAG with broad top-20 candidatesHybrid only
Bulk-recall analytic queries (top-500+)Dense only
Pure-semantic comparison studiesDense only (avoid BM25 contamination)
Latency-sensitive search-as-you-typeDense only (no rerank)
Image / audio / video retrievalDense + rerank_text (see Multimodal Search)

Prerequisites

  • A pipeline-mode collection with sparse_mode = BM25 (default for text_embedding_index):

    dodil k3 vector collection get docs -b kb-prod -o json | jq '{sparseMode, dimensions, embedModel}'

    Expect sparseMode: "SPARSE_MODE_BM25". If it’s SPARSE_MODE_NONE, hybrid mode falls back to dense (you’ll see a warning); recreate the collection from a BM25-enabled template.

  • Some content already ingested. See Pipeline Collection for the setup.

1. Dense-only baseline

The default — searchMode omitted or set to VECTOR:

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "docs", "text": "what is multi-head attention", "topK": 5, "searchMode": "SEARCH_MODE_VECTOR", "includeContent": true }' | jq '{ searchModeUsed, tookMs, results: [.results[] | {score, object: .object.key, source}] }'

Sample response:

{ "searchModeUsed": "vector", "tookMs": "42", "results": [ {"score": 0.87, "object": "papers/attention.pdf", "source": "SEARCH_SOURCE_VECTOR"}, {"score": 0.81, "object": "papers/transformer.pdf", "source": "SEARCH_SOURCE_VECTOR"} ] }

Strong semantic recall but misses results whose exact phrasing is “multi-head” if the embedding model didn’t preserve that token.

2. Add BM25 — hybrid mode

searchMode: "SEARCH_MODE_HYBRID" runs both signals and fuses via RRF (k=60):

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "docs", "text": "what is multi-head attention", "topK": 5, "searchMode": "SEARCH_MODE_HYBRID", "includeContent": true }' | jq '.results[] | {score, object: .object.key, source}'

Notice the source field on results:

[ {"score": 0.91, "object": "papers/attention.pdf", "source": "SEARCH_SOURCE_HYBRID"}, {"score": 0.84, "object": "papers/bert.pdf", "source": "SEARCH_SOURCE_HYBRID"}, {"score": 0.77, "object": "blog/multi-head-101.md", "source": "SEARCH_SOURCE_HYBRID"} ]

SEARCH_SOURCE_HYBRID confirms both signals contributed. Pure-BM25 matches (no dense hit) come back as SEARCH_SOURCE_FULLTEXT; pure-dense as SEARCH_SOURCE_VECTOR.

How RRF k=60 works

Reciprocal Rank Fusion combines two ranked lists into one:

rrf_score(doc) = 1/(k + dense_rank(doc)) + 1/(k + bm25_rank(doc))

With k=60 (Milvus default), the fusion is forgiving — a doc ranked #1 by one signal and #50 by the other still scores well. This rewards consensus between dense + sparse signals and is parameter-light compared to learned-weight fusion.

SEARCH_MODE_AUTO — the safe default

AUTO picks HYBRID when the collection has sparse (BM25 or EXTERNAL), VECTOR otherwise. Lets you write the same query against collections with different sparse configs:

{ "bucket": "kb-prod", "text": "...", "searchMode": "SEARCH_MODE_AUTO" }

For most production code, AUTO is the right setting.

3. Add rerank — top-K precision

rerank: true runs a Jina cross-encoder over the top-K candidates from tier 1+2. The cross-encoder scores each (query, candidate) pair directly — slower than bi-encoder retrieval but materially more accurate at the top.

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "docs", "text": "what is multi-head attention", "topK": 5, "searchMode": "SEARCH_MODE_AUTO", "rerank": true, "includeContent": true }' | jq '{ tookMs, results: [.results[] | {score, object: .object.key}] }'

Expect ~100–250 ms wall time for a topK: 5 with rerank, vs ~40–80 ms for hybrid-only.

Strategy — over-recall for rerank

Rerank works best when you give it more candidates to choose from. A common pattern:

# Pull 50 candidates with hybrid, rerank to top-5 curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "docs", "text": "...", "topK": 5, "searchMode": "SEARCH_MODE_AUTO", "rerank": true }'

K3 internally fetches a larger candidate set, runs the reranker, and returns the requested topK. You don’t need to manually over-fetch.

4. rerank_text for binary queries

When the query is an image / audio / video (s3_key shape), the reranker has no text to score against. Supply rerank_text to give it a textual handle:

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "collectionName": "product-images", "s3Key": "queries/example-bag.jpg", "contentType": "image/jpeg", "topK": 20, "rerank": true, "rerankText": "brown leather handbag with gold hardware" }'

The reranker scores each image-search result against the supplied text — useful for “images that look like X AND match this description Y”. See Multimodal Search for the full multimodal flow.

5. Worked benchmark — what each tier costs

Run all three tiers against the same query and compare:

QUERY="explain the attention mechanism" # Tier 1 — dense only time curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" -H "Content-Type: application/json" \ -d "{\"bucket\":\"kb-prod\",\"collectionName\":\"docs\",\"text\":\"$QUERY\",\"topK\":10,\"searchMode\":\"SEARCH_MODE_VECTOR\"}" \ > /dev/null # Tier 1+2 — hybrid time curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" -H "Content-Type: application/json" \ -d "{\"bucket\":\"kb-prod\",\"collectionName\":\"docs\",\"text\":\"$QUERY\",\"topK\":10,\"searchMode\":\"SEARCH_MODE_HYBRID\"}" \ > /dev/null # Tier 1+2+3 — hybrid + rerank time curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \ -H "Authorization: Bearer $DODIL_TOKEN" -H "Content-Type: application/json" \ -d "{\"bucket\":\"kb-prod\",\"collectionName\":\"docs\",\"text\":\"$QUERY\",\"topK\":10,\"searchMode\":\"SEARCH_MODE_HYBRID\",\"rerank\":true}" \ > /dev/null

Typical numbers on a ~10K-chunk collection (your mileage varies with corpus + model + hardware):

TierWall timeTop-1 precision
Dense~40 msbaseline
Hybrid~70 ms+5–10% over dense
Hybrid + rerank~200 ms+10–20% over hybrid

Rerank pays its latency cost on top-K precision; below ~topK: 100, the rerank time dominates the request budget.

6. When to skip each tier

Skip BM25 ifSkip rerank if
Collection is SPARSE_MODE_NONE — hybrid falls back anyway with a warningBulk-recall query (top_k > 100) — rerank latency dominates
You’re A/B-testing pure-semantic search vs hybridLatency-sensitive paths (search-as-you-type, interactive UI)
Token-overlap noise is hurting recall (queries with lots of common words)Your embedder + collection corpus are well-aligned (e.g. domain-tuned)

Common gotchas

SymptomCauseFix
searchMode: HYBRID + warnings: ["BM25 not available..."]Collection’s sparse_mode = NONERecreate the collection from a BM25-enabled template OR switch to SEARCH_MODE_VECTOR to skip the warning
rerank: true adds latency but precision didn’t visibly improveTop-K is already good (small corpus / domain-tuned embedder); rerank is overkillSkip rerank for this collection; cost is wasted
Different score magnitudes between dense / hybrid / rerank tiersEach tier produces scores on its own scale; rerank produces a cross-encoder scoreDon’t compare scores across tiers — use them as within-tier rankings only
rerank: true on an image query returns error: "no text available for rerank"Binary query (s3_key) without rerank_textSupply rerank_text — see step 4
Hybrid results dominated by BM25 keyword matchesRRF k=60 is fixed; no caller-tuneable fusion weight todayDrop to SEARCH_MODE_VECTOR for queries where you want pure-semantic ranking

See also