Sparse Inverted Index (SPARSE_INVERTED_INDEX)

SPARSE_INVERTED_INDEX is the default sparse-vector index type used by Dodil VBase to store and search sparse vectors efficiently. It uses an inverted index under the hood (think “term → list of matching documents”), which makes it a natural fit for vectors where most dimensions are zero.

This index is commonly used for:

Sparse embeddings (e.g., bag-of-words style, SPLADE-like outputs)
Lexical / keyword scoring (BM25-style search)
Hybrid workflows where you combine sparse retrieval with dense re-ranking

Assumption: you already have a connected vbase client. If not, see your VBase connection guide.

When should I use it?

Use SPARSE_INVERTED_INDEX when:

Your vector field is a sparse vector (most entries are 0)
Your query contains many non-zero dimensions (many “terms”)
You want predictable performance as the collection grows

If your data is dense (most dimensions non-zero), this index is the wrong tool—use a dense vector index instead.

Supported similarity metrics

For sparse vectors, VBase supports the same two primary metrics exposed by Milvus:

IP (Inner Product): dot-product similarity. Works well for most sparse embedding models.
BM25: text‑retrieval style scoring. Useful when your sparse vector represents tokens/terms and weights.

Build the index

You build a sparse inverted index on a sparse vector field.

The most important configuration is the algorithm used for query processing:

DAAT_MAXSCORE (default)
DAAT_WAND
TAAT_NAIVE

Example (Python):


# Assumes: vbase is already connected
 
vbase.create_index(
    collection_name="your_collection",
    field_name="your_sparse_vector_field",
    index_name="sparse_inverted_index",
    index_type="SPARSE_INVERTED_INDEX",
    metric_type="IP",  # or "BM25"
    params={
        "inverted_index_algo": "DAAT_MAXSCORE",
    },
)

Index building parameters

Parameter	What it controls	Allowed values	Practical guidance
`inverted_index_algo`	How queries are executed against the inverted index	`DAAT_MAXSCORE` (default), `DAAT_WAND`, `TAAT_NAIVE`	Start with `DAAT_MAXSCORE`. Prefer `DAAT_WAND` for small `k` / short queries. Use `TAAT_NAIVE` when you want behavior that adapts more directly to collection-level changes over time.

Search on the index

Once the index is built (and you’ve inserted data), you can run a sparse vector search.

A sparse vector is typically represented as a mapping of dimension → value (only for non-zero entries):


query_vector = [
    {1: 0.2, 50: 0.4, 1000: 0.7}
]
 
results = vbase.search(
    collection_name="your_collection",
    anns_field="your_sparse_vector_field",
    data=query_vector,
    limit=10,
)
 
print(results)

Search parameters

Parameter	What it controls	Allowed values	Practical guidance
`drop_ratio_search`	Drops the smallest‑magnitude values in the query vector before searching (noise filtering)	Float from `0.0` to `1.0`	If your sparse queries are noisy, try `0.1`–`0.3`. Higher values may improve precision and speed, but can reduce recall.

Example with search params:


results = vbase.search(
    collection_name="your_collection",
    anns_field="your_sparse_vector_field",
    data=query_vector,
    limit=10,
    search_params={
        "params": {
            "drop_ratio_search": 0.2,
        }
    },
)

Tuning cheat sheet

Many terms, larger top‑k → DAAT_MAXSCORE
Short queries, small top‑k → DAAT_WAND
You want the simplest behavior (often slower) → TAAT_NAIVE
Noisy sparse weights → raise drop_ratio_search gradually (e.g., 0.05 → 0.1 → 0.2)

Notes and gotchas

This index applies only to sparse vector fields.
For BM25-style use cases, make sure your pipeline produces a meaningful sparse representation (token weights), not a dense embedding.
Index build time and query speed both depend heavily on how many non-zero entries your vectors have.