Hybrid + Rerank

Goal: get the highest-precision top-K results from a vector collection. The trick is a 3-tier stack:


   query
     │
     ▼
   ┌───────────────────────────────────────┐
   │ Tier 1 — Dense vector retrieval       │   recall via semantic similarity
   │ (Milvus HNSW / IVF over embeddings)   │
   └───────────────────────────────────────┘
     │
     ▼
   ┌───────────────────────────────────────┐
   │ Tier 2 — Sparse BM25 retrieval        │   recall via token overlap
   │ (Milvus BM25 function)                │
   └───────────────────────────────────────┘
     │  ──── fused via RRF (k=60) ────►   merged candidates
     ▼
   ┌───────────────────────────────────────┐
   │ Tier 3 — Jina cross-encoder rerank    │   precision via deep pairwise scoring
   │ (over the top-K candidates)           │
   └───────────────────────────────────────┘
     │
     ▼
   final top-K

Each tier improves a different metric:

Tier	Latency cost	What it gives you
Dense only (`SEARCH_MODE_VECTOR`)	~10–50 ms	Strong semantic recall; misses exact-keyword matches
Dense + BM25 (`SEARCH_MODE_HYBRID`)	~15–80 ms	Catches both semantic + exact-token matches; balanced
+ Jina rerank	+ 50–200 ms	High top-K precision — best for RAG / agent grounding

When to use each tier

Use case	Recommended stack
RAG with strict top-3 quality	Hybrid + rerank
RAG with broad top-20 candidates	Hybrid only
Bulk-recall analytic queries (top-500+)	Dense only
Pure-semantic comparison studies	Dense only (avoid BM25 contamination)
Latency-sensitive search-as-you-type	Dense only (no rerank)
Image / audio / video retrieval	Dense + `rerank_text` (see Multimodal Search)

Prerequisites

A pipeline-mode collection with sparse_mode = BM25 (default for text_embedding_index):
```
dodil k3 vector collection get docs -b kb-prod -o json | jq '{sparseMode, dimensions, embedModel}'
```
Expect sparseMode: "SPARSE_MODE_BM25". If it’s SPARSE_MODE_NONE, hybrid mode falls back to dense (you’ll see a warning); recreate the collection from a BM25-enabled template.
Some content already ingested. See Pipeline Collection for the setup.

1. Dense-only baseline

The default — searchMode omitted or set to VECTOR:


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "docs",
    "text": "what is multi-head attention",
    "topK": 5,
    "searchMode": "SEARCH_MODE_VECTOR",
    "includeContent": true
  }' | jq '{
       searchModeUsed,
       tookMs,
       results: [.results[] | {score, object: .object.key, source}]
     }'

Sample response:


{
  "searchModeUsed": "vector",
  "tookMs": "42",
  "results": [
    {"score": 0.87, "object": "papers/attention.pdf",     "source": "SEARCH_SOURCE_VECTOR"},
    {"score": 0.81, "object": "papers/transformer.pdf",   "source": "SEARCH_SOURCE_VECTOR"}
  ]
}

Strong semantic recall but misses results whose exact phrasing is “multi-head” if the embedding model didn’t preserve that token.

2. Add BM25 — hybrid mode

searchMode: "SEARCH_MODE_HYBRID" runs both signals and fuses via RRF (k=60):


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "docs",
    "text": "what is multi-head attention",
    "topK": 5,
    "searchMode": "SEARCH_MODE_HYBRID",
    "includeContent": true
  }' | jq '.results[] | {score, object: .object.key, source}'

Notice the source field on results:


[
  {"score": 0.91, "object": "papers/attention.pdf",   "source": "SEARCH_SOURCE_HYBRID"},
  {"score": 0.84, "object": "papers/bert.pdf",        "source": "SEARCH_SOURCE_HYBRID"},
  {"score": 0.77, "object": "blog/multi-head-101.md", "source": "SEARCH_SOURCE_HYBRID"}
]

SEARCH_SOURCE_HYBRID confirms both signals contributed. Pure-BM25 matches (no dense hit) come back as SEARCH_SOURCE_FULLTEXT; pure-dense as SEARCH_SOURCE_VECTOR.

How RRF k=60 works

Reciprocal Rank Fusion combines two ranked lists into one:


rrf_score(doc) = 1/(k + dense_rank(doc)) + 1/(k + bm25_rank(doc))

With k=60 (Milvus default), the fusion is forgiving — a doc ranked #1 by one signal and #50 by the other still scores well. This rewards consensus between dense + sparse signals and is parameter-light compared to learned-weight fusion.

`SEARCH_MODE_AUTO` — the safe default

AUTO picks HYBRID when the collection has sparse (BM25 or EXTERNAL), VECTOR otherwise. Lets you write the same query against collections with different sparse configs:


{
  "bucket": "kb-prod",
  "text": "...",
  "searchMode": "SEARCH_MODE_AUTO"
}

For most production code, AUTO is the right setting.

3. Add rerank — top-K precision

rerank: true runs a Jina cross-encoder over the top-K candidates from tier 1+2. The cross-encoder scores each (query, candidate) pair directly — slower than bi-encoder retrieval but materially more accurate at the top.


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "docs",
    "text": "what is multi-head attention",
    "topK": 5,
    "searchMode": "SEARCH_MODE_AUTO",
    "rerank": true,
    "includeContent": true
  }' | jq '{
       tookMs,
       results: [.results[] | {score, object: .object.key}]
     }'

Expect ~100–250 ms wall time for a topK: 5 with rerank, vs ~40–80 ms for hybrid-only.

Strategy — over-recall for rerank

Rerank works best when you give it more candidates to choose from. A common pattern:


# Pull 50 candidates with hybrid, rerank to top-5
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "docs",
    "text": "...",
    "topK": 5,
    "searchMode": "SEARCH_MODE_AUTO",
    "rerank": true
  }'

K3 internally fetches a larger candidate set, runs the reranker, and returns the requested topK. You don’t need to manually over-fetch.

4. `rerank_text` for binary queries

When the query is an image / audio / video (s3_key shape), the reranker has no text to score against. Supply rerank_text to give it a textual handle:


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "collectionName": "product-images",
    "s3Key": "queries/example-bag.jpg",
    "contentType": "image/jpeg",
    "topK": 20,
    "rerank": true,
    "rerankText": "brown leather handbag with gold hardware"
  }'

The reranker scores each image-search result against the supplied text — useful for “images that look like X AND match this description Y”. See Multimodal Search for the full multimodal flow.

5. Worked benchmark — what each tier costs

Run all three tiers against the same query and compare:


QUERY="explain the attention mechanism"
 
# Tier 1 — dense only
time curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" -H "Content-Type: application/json" \
  -d "{\"bucket\":\"kb-prod\",\"collectionName\":\"docs\",\"text\":\"$QUERY\",\"topK\":10,\"searchMode\":\"SEARCH_MODE_VECTOR\"}" \
  > /dev/null
 
# Tier 1+2 — hybrid
time curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" -H "Content-Type: application/json" \
  -d "{\"bucket\":\"kb-prod\",\"collectionName\":\"docs\",\"text\":\"$QUERY\",\"topK\":10,\"searchMode\":\"SEARCH_MODE_HYBRID\"}" \
  > /dev/null
 
# Tier 1+2+3 — hybrid + rerank
time curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" -H "Content-Type: application/json" \
  -d "{\"bucket\":\"kb-prod\",\"collectionName\":\"docs\",\"text\":\"$QUERY\",\"topK\":10,\"searchMode\":\"SEARCH_MODE_HYBRID\",\"rerank\":true}" \
  > /dev/null

Typical numbers on a ~10K-chunk collection (your mileage varies with corpus + model + hardware):

Tier	Wall time	Top-1 precision
Dense	~40 ms	baseline
Hybrid	~70 ms	+5–10% over dense
Hybrid + rerank	~200 ms	+10–20% over hybrid

Rerank pays its latency cost on top-K precision; below ~topK: 100, the rerank time dominates the request budget.

6. When to skip each tier

Skip BM25 if	Skip rerank if
Collection is `SPARSE_MODE_NONE` — hybrid falls back anyway with a warning	Bulk-recall query (`top_k > 100`) — rerank latency dominates
You’re A/B-testing pure-semantic search vs hybrid	Latency-sensitive paths (search-as-you-type, interactive UI)
Token-overlap noise is hurting recall (queries with lots of common words)	Your embedder + collection corpus are well-aligned (e.g. domain-tuned)

Common gotchas

Symptom	Cause	Fix
`searchMode: HYBRID` + `warnings: ["BM25 not available..."]`	Collection’s `sparse_mode = NONE`	Recreate the collection from a BM25-enabled template OR switch to `SEARCH_MODE_VECTOR` to skip the warning
`rerank: true` adds latency but precision didn’t visibly improve	Top-K is already good (small corpus / domain-tuned embedder); rerank is overkill	Skip rerank for this collection; cost is wasted
Different `score` magnitudes between dense / hybrid / rerank tiers	Each tier produces scores on its own scale; rerank produces a cross-encoder score	Don’t compare scores across tiers — use them as within-tier rankings only
`rerank: true` on an image query returns `error: "no text available for rerank"`	Binary query (`s3_key`) without `rerank_text`	Supply `rerank_text` — see step 4
Hybrid results dominated by BM25 keyword matches	RRF k=60 is fixed; no caller-tuneable fusion weight today	Drop to `SEARCH_MODE_VECTOR` for queries where you want pure-semantic ranking