Full-text search (BM25)

Full‑text search is the “keyword search” side of VBase.

Instead of comparing dense embeddings, full‑text search ranks documents by how well their text matches a query (think: “traditional search”, but built into your vector database workflow).

Under the hood, VBase uses a BM25 pipeline:

You store raw text in a VarChar field (e.g. text).
A built‑in BM25 function converts that text into a sparse vector (e.g. stored in a SparseFloatVector field called sparse).
You create a sparse inverted index on the sparse field.
You search by passing a query string, and Milvus/BM25 returns the best matching documents.

This is useful when:

Your users expect keyword search (exact words, product codes, IDs, names).
You want strong relevance for short queries (“invoice overdue”, “return policy”, “SOC2”).
You plan to combine it later with embeddings for hybrid retrieval.

Prerequisites

Python 3.10+
A working VBase database connection (service account + VBase endpoint)

If you haven’t connected yet, see the Connect to VBase guide.

Step 1: Define a schema for full‑text search

A minimal full‑text schema has three fields:

id: primary key
text: where you store the document text (enable analyzer)
sparse: where BM25 will store the generated sparse vectors

You also attach a BM25 function that takes text as input and writes into sparse.


from dodil import Client
from dodil.vbase import VBaseConfig
 
# 1) Authorize
c = Client(
    service_account_id="...",
    service_account_secret="...",
)
 
# 2) Connect
vbase = c.vbase.connect(
    VBaseConfig(
        host="vbase-db-<id>.infra.dodil.cloud",
        port=443,
        scheme="https",
        db_name="db_<id>",
    )
)
 
# 3) Define the collection schema (conceptual)
schema = {
    "auto_id": True,
    "dynamic_fields": False,
    "fields": [
        {"name": "id", "dtype": "int64", "is_primary": True},
        {
            "name": "text",
            "dtype": "varchar",
            "max_length": 1000,
            # Important: enables tokenization/analyzer so BM25 can work
            "enable_analyzer": True,
        },
        {"name": "sparse", "dtype": "sparse_float_vector"},
    ],
    "functions": [
        {
            "name": "text_bm25_emb",
            "type": "bm25",
            "input_fields": ["text"],
            "output_fields": ["sparse"],
            "params": {},
        }
    ],
}

Notes:

enable_analyzer should be enabled on the text field.
The BM25 function must map one text field → one sparse field. If you have multiple text fields, define a BM25 function per field.

Step 2: Create the index (BM25 sparse inverted index)

BM25 search needs an inverted index on the generated sparse field.


index = {
    "field": "sparse",
    "index_type": "SPARSE_INVERTED_INDEX",
    "metric_type": "BM25",
    "params": {
        # You can start with these defaults
        "inverted_index_algo": "DAAT_MAXSCORE",
        "bm25_k1": 1.2,
        "bm25_b": 0.75,
    },
}

What these mean (simplified):

SPARSE_INVERTED_INDEX: index optimized for sparse vectors.
BM25: scoring method used for ranking.
bm25_k1 and bm25_b: tuning knobs for BM25 relevance. The defaults above are a good starting point.

Step 3: Create and load the collection

Create the collection using your schema + index. Then load it so it becomes searchable.


collection_name = "docs_full_text"
 
vbase.create_collection(
    name=collection_name,
    schema=schema,
    indexes=[index],
)
 
# Full‑text search requires the index to exist.
# Load makes the indexed data available for search.
vbase.load_collection(collection_name)

If you try to load/search without an index, you’ll typically see errors. Create the index first.

Step 4: Insert documents

Insert plain text documents. BM25 sparse vectors are generated automatically by the BM25 function.


docs = [
    {"text": "How to reset your service account secret"},
    {"text": "Troubleshooting: Kafka consumer group rebalance"},
    {"text": "Create a collection and build an index in VBase"},
]
 
vbase.insert(collection_name, docs)

Step 5: Search with a query string

To search, pass a query string and target the sparse field (the BM25 output).


query = "create collection index"
 
results = vbase.search(
    collection_name=collection_name,
    data=[query],
    anns_field="sparse",
    limit=5,
    output_fields=["text"],
)
 
for hit in results[0]:
    print(hit.score, hit.entity.get("text"))

Optional: filter + full‑text search

You can combine full‑text scoring with filters (for example: only documents from a tenant, a product, or a time range).


results = vbase.search(
    collection_name=collection_name,
    data=["kafka rebalance"],
    anns_field="sparse",
    limit=5,
    filter='tenant_id == "acme"',
    output_fields=["text"],
)

Practical tips

Keep text clean: strip boilerplate, remove repetitive headers, and keep meaningful content.
Tune BM25 later: start with defaults, then adjust bm25_k1 / bm25_b if ranking feels off.
Pair with vector search: BM25 is great for exact terms; embeddings are great for meaning. Together they’re even better.