Sparse Inverted Index (SPARSE_INVERTED_INDEX)
SPARSE_INVERTED_INDEX is the default sparse-vector index type used by Dodil VBase to store and search sparse vectors efficiently. It uses an inverted index under the hood (think “term → list of matching documents”), which makes it a natural fit for vectors where most dimensions are zero.
This index is commonly used for:
- Sparse embeddings (e.g., bag-of-words style, SPLADE-like outputs)
- Lexical / keyword scoring (BM25-style search)
- Hybrid workflows where you combine sparse retrieval with dense re-ranking
Assumption: you already have a connected
vbaseclient. If not, see your VBase connection guide.
When should I use it?
Use SPARSE_INVERTED_INDEX when:
- Your vector field is a sparse vector (most entries are 0)
- Your query contains many non-zero dimensions (many “terms”)
- You want predictable performance as the collection grows
If your data is dense (most dimensions non-zero), this index is the wrong tool—use a dense vector index instead.
Supported similarity metrics
For sparse vectors, VBase supports the same two primary metrics exposed by Milvus:
IP(Inner Product): dot-product similarity. Works well for most sparse embedding models.BM25: text‑retrieval style scoring. Useful when your sparse vector represents tokens/terms and weights.
Build the index
You build a sparse inverted index on a sparse vector field.
The most important configuration is the algorithm used for query processing:
DAAT_MAXSCORE(default)DAAT_WANDTAAT_NAIVE
Example (Python):
# Assumes: vbase is already connected
vbase.create_index(
collection_name="your_collection",
field_name="your_sparse_vector_field",
index_name="sparse_inverted_index",
index_type="SPARSE_INVERTED_INDEX",
metric_type="IP", # or "BM25"
params={
"inverted_index_algo": "DAAT_MAXSCORE",
},
)Index building parameters
| Parameter | What it controls | Allowed values | Practical guidance |
|---|---|---|---|
inverted_index_algo | How queries are executed against the inverted index | DAAT_MAXSCORE (default), DAAT_WAND, TAAT_NAIVE | Start with DAAT_MAXSCORE. Prefer DAAT_WAND for small k / short queries. Use TAAT_NAIVE when you want behavior that adapts more directly to collection-level changes over time. |
Search on the index
Once the index is built (and you’ve inserted data), you can run a sparse vector search.
A sparse vector is typically represented as a mapping of dimension → value (only for non-zero entries):
query_vector = [
{1: 0.2, 50: 0.4, 1000: 0.7}
]
results = vbase.search(
collection_name="your_collection",
anns_field="your_sparse_vector_field",
data=query_vector,
limit=10,
)
print(results)Search parameters
| Parameter | What it controls | Allowed values | Practical guidance |
|---|---|---|---|
drop_ratio_search | Drops the smallest‑magnitude values in the query vector before searching (noise filtering) | Float from 0.0 to 1.0 | If your sparse queries are noisy, try 0.1–0.3. Higher values may improve precision and speed, but can reduce recall. |
Example with search params:
results = vbase.search(
collection_name="your_collection",
anns_field="your_sparse_vector_field",
data=query_vector,
limit=10,
search_params={
"params": {
"drop_ratio_search": 0.2,
}
},
)Tuning cheat sheet
- Many terms, larger top‑k →
DAAT_MAXSCORE - Short queries, small top‑k →
DAAT_WAND - You want the simplest behavior (often slower) →
TAAT_NAIVE - Noisy sparse weights → raise
drop_ratio_searchgradually (e.g., 0.05 → 0.1 → 0.2)
Notes and gotchas
- This index applies only to sparse vector fields.
- For BM25-style use cases, make sure your pipeline produces a meaningful sparse representation (token weights), not a dense embedding.
- Index build time and query speed both depend heavily on how many non-zero entries your vectors have.