Hybrid + Rerank
Goal: get the highest-precision top-K results from a vector collection. The trick is a 3-tier stack:
query
│
▼
┌───────────────────────────────────────┐
│ Tier 1 — Dense vector retrieval │ recall via semantic similarity
│ (Milvus HNSW / IVF over embeddings) │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ Tier 2 — Sparse BM25 retrieval │ recall via token overlap
│ (Milvus BM25 function) │
└───────────────────────────────────────┘
│ ──── fused via RRF (k=60) ────► merged candidates
▼
┌───────────────────────────────────────┐
│ Tier 3 — Jina cross-encoder rerank │ precision via deep pairwise scoring
│ (over the top-K candidates) │
└───────────────────────────────────────┘
│
▼
final top-KEach tier improves a different metric:
| Tier | Latency cost | What it gives you |
|---|---|---|
Dense only (SEARCH_MODE_VECTOR) | ~10–50 ms | Strong semantic recall; misses exact-keyword matches |
Dense + BM25 (SEARCH_MODE_HYBRID) | ~15–80 ms | Catches both semantic + exact-token matches; balanced |
| + Jina rerank | + 50–200 ms | High top-K precision — best for RAG / agent grounding |
When to use each tier
| Use case | Recommended stack |
|---|---|
| RAG with strict top-3 quality | Hybrid + rerank |
| RAG with broad top-20 candidates | Hybrid only |
| Bulk-recall analytic queries (top-500+) | Dense only |
| Pure-semantic comparison studies | Dense only (avoid BM25 contamination) |
| Latency-sensitive search-as-you-type | Dense only (no rerank) |
| Image / audio / video retrieval | Dense + rerank_text (see Multimodal Search) |
Prerequisites
-
A pipeline-mode collection with
sparse_mode = BM25(default fortext_embedding_index):dodil k3 vector collection get docs -b kb-prod -o json | jq '{sparseMode, dimensions, embedModel}'Expect
sparseMode: "SPARSE_MODE_BM25". If it’sSPARSE_MODE_NONE, hybrid mode falls back to dense (you’ll see a warning); recreate the collection from a BM25-enabled template. -
Some content already ingested. See Pipeline Collection for the setup.
1. Dense-only baseline
The default — searchMode omitted or set to VECTOR:
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "docs",
"text": "what is multi-head attention",
"topK": 5,
"searchMode": "SEARCH_MODE_VECTOR",
"includeContent": true
}' | jq '{
searchModeUsed,
tookMs,
results: [.results[] | {score, object: .object.key, source}]
}'Sample response:
{
"searchModeUsed": "vector",
"tookMs": "42",
"results": [
{"score": 0.87, "object": "papers/attention.pdf", "source": "SEARCH_SOURCE_VECTOR"},
{"score": 0.81, "object": "papers/transformer.pdf", "source": "SEARCH_SOURCE_VECTOR"}
]
}Strong semantic recall but misses results whose exact phrasing is “multi-head” if the embedding model didn’t preserve that token.
2. Add BM25 — hybrid mode
searchMode: "SEARCH_MODE_HYBRID" runs both signals and fuses via RRF (k=60):
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "docs",
"text": "what is multi-head attention",
"topK": 5,
"searchMode": "SEARCH_MODE_HYBRID",
"includeContent": true
}' | jq '.results[] | {score, object: .object.key, source}'Notice the source field on results:
[
{"score": 0.91, "object": "papers/attention.pdf", "source": "SEARCH_SOURCE_HYBRID"},
{"score": 0.84, "object": "papers/bert.pdf", "source": "SEARCH_SOURCE_HYBRID"},
{"score": 0.77, "object": "blog/multi-head-101.md", "source": "SEARCH_SOURCE_HYBRID"}
]SEARCH_SOURCE_HYBRID confirms both signals contributed. Pure-BM25 matches (no dense hit) come back as SEARCH_SOURCE_FULLTEXT; pure-dense as SEARCH_SOURCE_VECTOR.
How RRF k=60 works
Reciprocal Rank Fusion combines two ranked lists into one:
rrf_score(doc) = 1/(k + dense_rank(doc)) + 1/(k + bm25_rank(doc))With k=60 (Milvus default), the fusion is forgiving — a doc ranked #1 by one signal and #50 by the other still scores well. This rewards consensus between dense + sparse signals and is parameter-light compared to learned-weight fusion.
SEARCH_MODE_AUTO — the safe default
AUTO picks HYBRID when the collection has sparse (BM25 or EXTERNAL), VECTOR otherwise. Lets you write the same query against collections with different sparse configs:
{
"bucket": "kb-prod",
"text": "...",
"searchMode": "SEARCH_MODE_AUTO"
}For most production code, AUTO is the right setting.
3. Add rerank — top-K precision
rerank: true runs a Jina cross-encoder over the top-K candidates from tier 1+2. The cross-encoder scores each (query, candidate) pair directly — slower than bi-encoder retrieval but materially more accurate at the top.
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "docs",
"text": "what is multi-head attention",
"topK": 5,
"searchMode": "SEARCH_MODE_AUTO",
"rerank": true,
"includeContent": true
}' | jq '{
tookMs,
results: [.results[] | {score, object: .object.key}]
}'Expect ~100–250 ms wall time for a topK: 5 with rerank, vs ~40–80 ms for hybrid-only.
Strategy — over-recall for rerank
Rerank works best when you give it more candidates to choose from. A common pattern:
# Pull 50 candidates with hybrid, rerank to top-5
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "docs",
"text": "...",
"topK": 5,
"searchMode": "SEARCH_MODE_AUTO",
"rerank": true
}'K3 internally fetches a larger candidate set, runs the reranker, and returns the requested topK. You don’t need to manually over-fetch.
4. rerank_text for binary queries
When the query is an image / audio / video (s3_key shape), the reranker has no text to score against. Supply rerank_text to give it a textual handle:
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"collectionName": "product-images",
"s3Key": "queries/example-bag.jpg",
"contentType": "image/jpeg",
"topK": 20,
"rerank": true,
"rerankText": "brown leather handbag with gold hardware"
}'The reranker scores each image-search result against the supplied text — useful for “images that look like X AND match this description Y”. See Multimodal Search for the full multimodal flow.
5. Worked benchmark — what each tier costs
Run all three tiers against the same query and compare:
QUERY="explain the attention mechanism"
# Tier 1 — dense only
time curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" -H "Content-Type: application/json" \
-d "{\"bucket\":\"kb-prod\",\"collectionName\":\"docs\",\"text\":\"$QUERY\",\"topK\":10,\"searchMode\":\"SEARCH_MODE_VECTOR\"}" \
> /dev/null
# Tier 1+2 — hybrid
time curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" -H "Content-Type: application/json" \
-d "{\"bucket\":\"kb-prod\",\"collectionName\":\"docs\",\"text\":\"$QUERY\",\"topK\":10,\"searchMode\":\"SEARCH_MODE_HYBRID\"}" \
> /dev/null
# Tier 1+2+3 — hybrid + rerank
time curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
-H "Authorization: Bearer $DODIL_TOKEN" -H "Content-Type: application/json" \
-d "{\"bucket\":\"kb-prod\",\"collectionName\":\"docs\",\"text\":\"$QUERY\",\"topK\":10,\"searchMode\":\"SEARCH_MODE_HYBRID\",\"rerank\":true}" \
> /dev/nullTypical numbers on a ~10K-chunk collection (your mileage varies with corpus + model + hardware):
| Tier | Wall time | Top-1 precision |
|---|---|---|
| Dense | ~40 ms | baseline |
| Hybrid | ~70 ms | +5–10% over dense |
| Hybrid + rerank | ~200 ms | +10–20% over hybrid |
Rerank pays its latency cost on top-K precision; below ~topK: 100, the rerank time dominates the request budget.
6. When to skip each tier
| Skip BM25 if | Skip rerank if |
|---|---|
Collection is SPARSE_MODE_NONE — hybrid falls back anyway with a warning | Bulk-recall query (top_k > 100) — rerank latency dominates |
| You’re A/B-testing pure-semantic search vs hybrid | Latency-sensitive paths (search-as-you-type, interactive UI) |
| Token-overlap noise is hurting recall (queries with lots of common words) | Your embedder + collection corpus are well-aligned (e.g. domain-tuned) |
Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
searchMode: HYBRID + warnings: ["BM25 not available..."] | Collection’s sparse_mode = NONE | Recreate the collection from a BM25-enabled template OR switch to SEARCH_MODE_VECTOR to skip the warning |
rerank: true adds latency but precision didn’t visibly improve | Top-K is already good (small corpus / domain-tuned embedder); rerank is overkill | Skip rerank for this collection; cost is wasted |
Different score magnitudes between dense / hybrid / rerank tiers | Each tier produces scores on its own scale; rerank produces a cross-encoder score | Don’t compare scores across tiers — use them as within-tier rankings only |
rerank: true on an image query returns error: "no text available for rerank" | Binary query (s3_key) without rerank_text | Supply rerank_text — see step 4 |
| Hybrid results dominated by BM25 keyword matches | RRF k=60 is fixed; no caller-tuneable fusion weight today | Drop to SEARCH_MODE_VECTOR for queries where you want pure-semantic ranking |
See also
- Search — API Reference → Rerank — rerank semantics +
rerank_text - Search — API Reference → Three search modes — VECTOR / HYBRID / AUTO matrix
- Pipeline Collection — how to get a BM25-enabled collection in the first place
- Multimodal Search — uses
rerank_textfor binary queries - External Collection → SPLADE — using EXTERNAL sparse (alternative to BM25)