Sparse vectors are a practical way to represent keyword signals (exact terms, codes, IDs, names) in a format that can be searched efficiently—especially when you want predictable matching behavior.
Dense vectors are great at semantics (“meaning”), but they can miss rare tokens or exact identifiers. Sparse vectors complement that by encoding which tokens matter and how much they matter.
DODIL VBase is Milvus-backed, and exposes sparse vectors through the same DODIL SDK interface you use for dense vectors.
What is a sparse vector?
A sparse vector is a very high‑dimensional vector where most values are zero.
Instead of storing every dimension (which would be wasteful), sparse vectors store only the non‑zero entries as:
{dimension_index: value}pairs (dictionary format), or[(dimension_index, value), ...]tuples, or- a sparse matrix type such as SciPy
csr_matrix.
Example (dictionary format):
# Only a few dimensions exist; the rest are implicitly 0
sparse_vec = {27: 0.5, 100: 0.3, 5369: 0.6}Why sparse vectors matter
Sparse vectors are commonly used for surface-level term matching in information retrieval:
- Exact term matching (“invoice”, “passport”, “SKU-1839”)
- Names, codes, IDs, and structured strings
- Keyword-style retrieval where precision matters
- Hybrid retrieval: dense semantic search + sparse keyword scoring
When should you use sparse vectors?
Use sparse vectors when one or more of these is true:
- Your users search for specific terms (IDs, part numbers, emails, ticket references).
- You want predictable keyword matching in addition to semantic similarity.
- You want to combine dense and sparse signals (hybrid retrieval).
If you only need semantic matching (“find documents about onboarding”), dense vectors are often enough.
Data formats supported
DODIL follows the underlying VBase engine formats for sparse vector inputs. The most common options are:
1) List of dictionaries
sparse_vectors = [
{27: 0.5, 100: 0.3, 5369: 0.6},
{3: 0.8, 100: 0.1},
]2) List of tuple iterables
sparse_vectors = [
[(27, 0.5), (100, 0.3), (5369, 0.6)],
[(3, 0.8), (100, 0.1)],
]3) SciPy sparse matrix
If you already produce sparse vectors from a model pipeline (e.g., SPLADE-like outputs), you can pass them as sparse matrices.
from scipy.sparse import csr_matrix
# Example: one vector with indices [27, 100, 5369] and values [0.5, 0.3, 0.6]
idx = [27, 100, 5369]
val = [0.5, 0.3, 0.6]
vec = csr_matrix((val, ([0] * len(idx), idx)), shape=(1, 5369 + 1))
sparse_vectors = [vec]Create a schema with a sparse vector field
To store sparse vectors, your collection schema typically includes:
- A primary key field (string or int)
- A
SPARSE_FLOAT_VECTORfield (the sparse vector) - The original text (optional but recommended)
Here is an example using the DODIL SDK types:
from dodil import Client
from dodil.vbase import VBaseConfig, CollectionSchema, FieldSchema, DataType
# Authorize
c = Client(
service_account_id="...",
service_account_secret="...",
)
# Connect
vbase = c.vbase.connect(
VBaseConfig(
host="vbase-db-<id>.infra.dodil.cloud",
port=443,
scheme="https",
db_name="db_<id>",
)
)
schema = CollectionSchema(
fields=[
FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, max_length=100),
FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
# Store original text for debugging, display, filtering, etc.
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
],
description="Documents with sparse vectors for keyword matching",
)
vbase.create_collection(
collection_name="docs_sparse",
schema=schema,
# You can also pass index_params depending on your setup.
)Notes
- Sparse vectors don’t usually need an explicit
dimensionparameter, because the vector is defined by its indices. - Keep your
textfield if you want to filter, debug, or display the original content.
Insert data
Insert rows where sparse_vector is provided in one of the supported formats:
rows = [
{
"pk": "doc_001",
"text": "Invoice INV-1938 was paid on 2026-01-21",
"sparse_vector": {27: 0.5, 100: 0.3, 5369: 0.6},
},
{
"pk": "doc_002",
"text": "Passport verification for user 88492",
"sparse_vector": {3: 0.8, 100: 0.1},
},
]
vbase.insert("docs_sparse", rows)Search with sparse vectors
Sparse vectors are most useful when your query can be represented as a sparse vector too.
- If you generate sparse query vectors externally, pass them directly.
- For keyword-first experiences, sparse retrieval is often paired with filtering and/or a hybrid strategy.
Example search (vector-only):
query_sparse = {100: 0.2, 5369: 0.4}
res = vbase.search(
collection_name="docs_sparse",
data=[query_sparse],
anns_field="sparse_vector",
limit=5,
output_fields=["pk", "text"],
)
print(res)Best practices
- Use sparse vectors for exact terms: codes, SKUs, invoice IDs, ticket IDs, emails, and names.
- Pair with dense vectors for hybrid retrieval: dense = semantics, sparse = keywords.
- Keep the original text in the collection for filtering and display.
- Validate your pipeline: ensure your sparse vectors share a consistent vocabulary/tokenization strategy.
What’s next
If you’re planning hybrid retrieval (dense + sparse + reranking), check the Hybrid Search and Reranking docs.