Skip to Content
We are live but in Staging 🎉

Full‑text search is the “keyword search” side of VBase.

Instead of comparing dense embeddings, full‑text search ranks documents by how well their text matches a query (think: “traditional search”, but built into your vector database workflow).

Under the hood, VBase uses a BM25 pipeline:

  1. You store raw text in a VarChar field (e.g. text).
  2. A built‑in BM25 function converts that text into a sparse vector (e.g. stored in a SparseFloatVector field called sparse).
  3. You create a sparse inverted index on the sparse field.
  4. You search by passing a query string, and Milvus/BM25 returns the best matching documents.

This is useful when:

  • Your users expect keyword search (exact words, product codes, IDs, names).
  • You want strong relevance for short queries (“invoice overdue”, “return policy”, “SOC2”).
  • You plan to combine it later with embeddings for hybrid retrieval.

Prerequisites

  • Python 3.10+
  • A working VBase database connection (service account + VBase endpoint)

If you haven’t connected yet, see the Connect to VBase guide.

A minimal full‑text schema has three fields:

  • id: primary key
  • text: where you store the document text (enable analyzer)
  • sparse: where BM25 will store the generated sparse vectors

You also attach a BM25 function that takes text as input and writes into sparse.

from dodil import Client from dodil.vbase import VBaseConfig # 1) Authorize c = Client( service_account_id="...", service_account_secret="...", ) # 2) Connect vbase = c.vbase.connect( VBaseConfig( host="vbase-db-<id>.infra.dodil.cloud", port=443, scheme="https", db_name="db_<id>", ) ) # 3) Define the collection schema (conceptual) schema = { "auto_id": True, "dynamic_fields": False, "fields": [ {"name": "id", "dtype": "int64", "is_primary": True}, { "name": "text", "dtype": "varchar", "max_length": 1000, # Important: enables tokenization/analyzer so BM25 can work "enable_analyzer": True, }, {"name": "sparse", "dtype": "sparse_float_vector"}, ], "functions": [ { "name": "text_bm25_emb", "type": "bm25", "input_fields": ["text"], "output_fields": ["sparse"], "params": {}, } ], }

Notes:

  • enable_analyzer should be enabled on the text field.
  • The BM25 function must map one text field → one sparse field. If you have multiple text fields, define a BM25 function per field.

Step 2: Create the index (BM25 sparse inverted index)

BM25 search needs an inverted index on the generated sparse field.

index = { "field": "sparse", "index_type": "SPARSE_INVERTED_INDEX", "metric_type": "BM25", "params": { # You can start with these defaults "inverted_index_algo": "DAAT_MAXSCORE", "bm25_k1": 1.2, "bm25_b": 0.75, }, }

What these mean (simplified):

  • SPARSE_INVERTED_INDEX: index optimized for sparse vectors.
  • BM25: scoring method used for ranking.
  • bm25_k1 and bm25_b: tuning knobs for BM25 relevance. The defaults above are a good starting point.

Step 3: Create and load the collection

Create the collection using your schema + index. Then load it so it becomes searchable.

collection_name = "docs_full_text" vbase.create_collection( name=collection_name, schema=schema, indexes=[index], ) # Full‑text search requires the index to exist. # Load makes the indexed data available for search. vbase.load_collection(collection_name)

If you try to load/search without an index, you’ll typically see errors. Create the index first.

Step 4: Insert documents

Insert plain text documents. BM25 sparse vectors are generated automatically by the BM25 function.

docs = [ {"text": "How to reset your service account secret"}, {"text": "Troubleshooting: Kafka consumer group rebalance"}, {"text": "Create a collection and build an index in VBase"}, ] vbase.insert(collection_name, docs)

Step 5: Search with a query string

To search, pass a query string and target the sparse field (the BM25 output).

query = "create collection index" results = vbase.search( collection_name=collection_name, data=[query], anns_field="sparse", limit=5, output_fields=["text"], ) for hit in results[0]: print(hit.score, hit.entity.get("text"))

You can combine full‑text scoring with filters (for example: only documents from a tenant, a product, or a time range).

results = vbase.search( collection_name=collection_name, data=["kafka rebalance"], anns_field="sparse", limit=5, filter='tenant_id == "acme"', output_fields=["text"], )

Practical tips

  • Keep text clean: strip boilerplate, remove repetitive headers, and keep meaningful content.
  • Tune BM25 later: start with defaults, then adjust bm25_k1 / bm25_b if ranking feels off.
  • Pair with vector search: BM25 is great for exact terms; embeddings are great for meaning. Together they’re even better.
Last updated on