Sparse vectors

Sparse vectors are a practical way to represent keyword signals (exact terms, codes, IDs, names) in a format that can be searched efficiently—especially when you want predictable matching behavior.

Dense vectors are great at semantics (“meaning”), but they can miss rare tokens or exact identifiers. Sparse vectors complement that by encoding which tokens matter and how much they matter.

DODIL VBase is Milvus-backed, and exposes sparse vectors through the same DODIL SDK interface you use for dense vectors.

What is a sparse vector?

A sparse vector is a very high‑dimensional vector where most values are zero.

Instead of storing every dimension (which would be wasteful), sparse vectors store only the non‑zero entries as:

{dimension_index: value} pairs (dictionary format), or
[(dimension_index, value), ...] tuples, or
a sparse matrix type such as SciPy csr_matrix.

Example (dictionary format):


# Only a few dimensions exist; the rest are implicitly 0
sparse_vec = {27: 0.5, 100: 0.3, 5369: 0.6}

Why sparse vectors matter

Sparse vectors are commonly used for surface-level term matching in information retrieval:

Exact term matching (“invoice”, “passport”, “SKU-1839”)
Names, codes, IDs, and structured strings
Keyword-style retrieval where precision matters
Hybrid retrieval: dense semantic search + sparse keyword scoring

When should you use sparse vectors?

Use sparse vectors when one or more of these is true:

Your users search for specific terms (IDs, part numbers, emails, ticket references).
You want predictable keyword matching in addition to semantic similarity.
You want to combine dense and sparse signals (hybrid retrieval).

If you only need semantic matching (“find documents about onboarding”), dense vectors are often enough.

Data formats supported

DODIL follows the underlying VBase engine formats for sparse vector inputs. The most common options are:

1) List of dictionaries


sparse_vectors = [
    {27: 0.5, 100: 0.3, 5369: 0.6},
    {3: 0.8, 100: 0.1},
]

2) List of tuple iterables


sparse_vectors = [
    [(27, 0.5), (100, 0.3), (5369, 0.6)],
    [(3, 0.8), (100, 0.1)],
]

3) SciPy sparse matrix

If you already produce sparse vectors from a model pipeline (e.g., SPLADE-like outputs), you can pass them as sparse matrices.


from scipy.sparse import csr_matrix
 
# Example: one vector with indices [27, 100, 5369] and values [0.5, 0.3, 0.6]
idx = [27, 100, 5369]
val = [0.5, 0.3, 0.6]
vec = csr_matrix((val, ([0] * len(idx), idx)), shape=(1, 5369 + 1))
 
sparse_vectors = [vec]

Create a schema with a sparse vector field

To store sparse vectors, your collection schema typically includes:

A primary key field (string or int)
A SPARSE_FLOAT_VECTOR field (the sparse vector)
The original text (optional but recommended)

Here is an example using the DODIL SDK types:


from dodil import Client
from dodil.vbase import VBaseConfig, CollectionSchema, FieldSchema, DataType
 
# Authorize
c = Client(
    service_account_id="...",
    service_account_secret="...",
)
 
# Connect
vbase = c.vbase.connect(
    VBaseConfig(
        host="vbase-db-<id>.infra.dodil.cloud",
        port=443,
        scheme="https",
        db_name="db_<id>",
    )
)
 
schema = CollectionSchema(
    fields=[
        FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, max_length=100),
        FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
        # Store original text for debugging, display, filtering, etc.
        FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
    ],
    description="Documents with sparse vectors for keyword matching",
)
 
vbase.create_collection(
    collection_name="docs_sparse",
    schema=schema,
    # You can also pass index_params depending on your setup.
)

Notes

Sparse vectors don’t usually need an explicit dimension parameter, because the vector is defined by its indices.
Keep your text field if you want to filter, debug, or display the original content.

Insert data

Insert rows where sparse_vector is provided in one of the supported formats:


rows = [
    {
        "pk": "doc_001",
        "text": "Invoice INV-1938 was paid on 2026-01-21",
        "sparse_vector": {27: 0.5, 100: 0.3, 5369: 0.6},
    },
    {
        "pk": "doc_002",
        "text": "Passport verification for user 88492",
        "sparse_vector": {3: 0.8, 100: 0.1},
    },
]
 
vbase.insert("docs_sparse", rows)

Search with sparse vectors

Sparse vectors are most useful when your query can be represented as a sparse vector too.

If you generate sparse query vectors externally, pass them directly.
For keyword-first experiences, sparse retrieval is often paired with filtering and/or a hybrid strategy.

Example search (vector-only):


query_sparse = {100: 0.2, 5369: 0.4}
 
res = vbase.search(
    collection_name="docs_sparse",
    data=[query_sparse],
    anns_field="sparse_vector",
    limit=5,
    output_fields=["pk", "text"],
)
 
print(res)

Best practices

Use sparse vectors for exact terms: codes, SKUs, invoice IDs, ticket IDs, emails, and names.
Pair with dense vectors for hybrid retrieval: dense = semantics, sparse = keywords.
Keep the original text in the collection for filtering and display.
Validate your pipeline: ensure your sparse vectors share a consistent vocabulary/tokenization strategy.

What’s next

If you’re planning hybrid retrieval (dense + sparse + reranking), check the Hybrid Search and Reranking docs.