Full‑text search is the “keyword search” side of VBase.
Instead of comparing dense embeddings, full‑text search ranks documents by how well their text matches a query (think: “traditional search”, but built into your vector database workflow).
Under the hood, VBase uses a BM25 pipeline:
- You store raw text in a
VarCharfield (e.g.text). - A built‑in BM25 function converts that text into a sparse vector (e.g. stored in a
SparseFloatVectorfield calledsparse). - You create a sparse inverted index on the
sparsefield. - You search by passing a query string, and Milvus/BM25 returns the best matching documents.
This is useful when:
- Your users expect keyword search (exact words, product codes, IDs, names).
- You want strong relevance for short queries (“invoice overdue”, “return policy”, “SOC2”).
- You plan to combine it later with embeddings for hybrid retrieval.
Prerequisites
- Python 3.10+
- A working VBase database connection (service account + VBase endpoint)
If you haven’t connected yet, see the Connect to VBase guide.
Step 1: Define a schema for full‑text search
A minimal full‑text schema has three fields:
id: primary keytext: where you store the document text (enable analyzer)sparse: where BM25 will store the generated sparse vectors
You also attach a BM25 function that takes text as input and writes into sparse.
from dodil import Client
from dodil.vbase import VBaseConfig
# 1) Authorize
c = Client(
service_account_id="...",
service_account_secret="...",
)
# 2) Connect
vbase = c.vbase.connect(
VBaseConfig(
host="vbase-db-<id>.infra.dodil.cloud",
port=443,
scheme="https",
db_name="db_<id>",
)
)
# 3) Define the collection schema (conceptual)
schema = {
"auto_id": True,
"dynamic_fields": False,
"fields": [
{"name": "id", "dtype": "int64", "is_primary": True},
{
"name": "text",
"dtype": "varchar",
"max_length": 1000,
# Important: enables tokenization/analyzer so BM25 can work
"enable_analyzer": True,
},
{"name": "sparse", "dtype": "sparse_float_vector"},
],
"functions": [
{
"name": "text_bm25_emb",
"type": "bm25",
"input_fields": ["text"],
"output_fields": ["sparse"],
"params": {},
}
],
}Notes:
enable_analyzershould be enabled on thetextfield.- The BM25 function must map one text field → one sparse field. If you have multiple text fields, define a BM25 function per field.
Step 2: Create the index (BM25 sparse inverted index)
BM25 search needs an inverted index on the generated sparse field.
index = {
"field": "sparse",
"index_type": "SPARSE_INVERTED_INDEX",
"metric_type": "BM25",
"params": {
# You can start with these defaults
"inverted_index_algo": "DAAT_MAXSCORE",
"bm25_k1": 1.2,
"bm25_b": 0.75,
},
}What these mean (simplified):
SPARSE_INVERTED_INDEX: index optimized for sparse vectors.BM25: scoring method used for ranking.bm25_k1andbm25_b: tuning knobs for BM25 relevance. The defaults above are a good starting point.
Step 3: Create and load the collection
Create the collection using your schema + index. Then load it so it becomes searchable.
collection_name = "docs_full_text"
vbase.create_collection(
name=collection_name,
schema=schema,
indexes=[index],
)
# Full‑text search requires the index to exist.
# Load makes the indexed data available for search.
vbase.load_collection(collection_name)If you try to load/search without an index, you’ll typically see errors. Create the index first.
Step 4: Insert documents
Insert plain text documents. BM25 sparse vectors are generated automatically by the BM25 function.
docs = [
{"text": "How to reset your service account secret"},
{"text": "Troubleshooting: Kafka consumer group rebalance"},
{"text": "Create a collection and build an index in VBase"},
]
vbase.insert(collection_name, docs)Step 5: Search with a query string
To search, pass a query string and target the sparse field (the BM25 output).
query = "create collection index"
results = vbase.search(
collection_name=collection_name,
data=[query],
anns_field="sparse",
limit=5,
output_fields=["text"],
)
for hit in results[0]:
print(hit.score, hit.entity.get("text"))Optional: filter + full‑text search
You can combine full‑text scoring with filters (for example: only documents from a tenant, a product, or a time range).
results = vbase.search(
collection_name=collection_name,
data=["kafka rebalance"],
anns_field="sparse",
limit=5,
filter='tenant_id == "acme"',
output_fields=["text"],
)Practical tips
- Keep
textclean: strip boilerplate, remove repetitive headers, and keep meaningful content. - Tune BM25 later: start with defaults, then adjust
bm25_k1/bm25_bif ranking feels off. - Pair with vector search: BM25 is great for exact terms; embeddings are great for meaning. Together they’re even better.