Skip to Content
We are live but in Staging 🎉

Sparse Inverted Index (SPARSE_INVERTED_INDEX)

SPARSE_INVERTED_INDEX is the default sparse-vector index type used by Dodil VBase to store and search sparse vectors efficiently. It uses an inverted index under the hood (think “term → list of matching documents”), which makes it a natural fit for vectors where most dimensions are zero.

This index is commonly used for:

  • Sparse embeddings (e.g., bag-of-words style, SPLADE-like outputs)
  • Lexical / keyword scoring (BM25-style search)
  • Hybrid workflows where you combine sparse retrieval with dense re-ranking

Assumption: you already have a connected vbase client. If not, see your VBase connection guide.

When should I use it?

Use SPARSE_INVERTED_INDEX when:

  • Your vector field is a sparse vector (most entries are 0)
  • Your query contains many non-zero dimensions (many “terms”)
  • You want predictable performance as the collection grows

If your data is dense (most dimensions non-zero), this index is the wrong tool—use a dense vector index instead.

Supported similarity metrics

For sparse vectors, VBase supports the same two primary metrics exposed by Milvus:

  • IP (Inner Product): dot-product similarity. Works well for most sparse embedding models.
  • BM25: text‑retrieval style scoring. Useful when your sparse vector represents tokens/terms and weights.

Build the index

You build a sparse inverted index on a sparse vector field.

The most important configuration is the algorithm used for query processing:

  • DAAT_MAXSCORE (default)
  • DAAT_WAND
  • TAAT_NAIVE

Example (Python):

# Assumes: vbase is already connected vbase.create_index( collection_name="your_collection", field_name="your_sparse_vector_field", index_name="sparse_inverted_index", index_type="SPARSE_INVERTED_INDEX", metric_type="IP", # or "BM25" params={ "inverted_index_algo": "DAAT_MAXSCORE", }, )

Index building parameters

ParameterWhat it controlsAllowed valuesPractical guidance
inverted_index_algoHow queries are executed against the inverted indexDAAT_MAXSCORE (default), DAAT_WAND, TAAT_NAIVEStart with DAAT_MAXSCORE. Prefer DAAT_WAND for small k / short queries. Use TAAT_NAIVE when you want behavior that adapts more directly to collection-level changes over time.

Search on the index

Once the index is built (and you’ve inserted data), you can run a sparse vector search.

A sparse vector is typically represented as a mapping of dimension → value (only for non-zero entries):

query_vector = [ {1: 0.2, 50: 0.4, 1000: 0.7} ] results = vbase.search( collection_name="your_collection", anns_field="your_sparse_vector_field", data=query_vector, limit=10, ) print(results)

Search parameters

ParameterWhat it controlsAllowed valuesPractical guidance
drop_ratio_searchDrops the smallest‑magnitude values in the query vector before searching (noise filtering)Float from 0.0 to 1.0If your sparse queries are noisy, try 0.1–0.3. Higher values may improve precision and speed, but can reduce recall.

Example with search params:

results = vbase.search( collection_name="your_collection", anns_field="your_sparse_vector_field", data=query_vector, limit=10, search_params={ "params": { "drop_ratio_search": 0.2, } }, )

Tuning cheat sheet

  • Many terms, larger top‑k → DAAT_MAXSCORE
  • Short queries, small top‑k → DAAT_WAND
  • You want the simplest behavior (often slower) → TAAT_NAIVE
  • Noisy sparse weights → raise drop_ratio_search gradually (e.g., 0.05 → 0.1 → 0.2)

Notes and gotchas

  • This index applies only to sparse vector fields.
  • For BM25-style use cases, make sure your pipeline produces a meaningful sparse representation (token weights), not a dense embedding.
  • Index build time and query speed both depend heavily on how many non-zero entries your vectors have.
Last updated on