NGRAM index

The NGRAM index is a scalar index designed to accelerate LIKE filters on text.

It’s useful when you want fast filtering on:

Prefix matches (e.g. "data%")
Suffix matches (e.g. "%json")
Infix / contains matches (e.g. "%vector%")
Wildcard patterns that would otherwise require scanning many rows

In Dodil, you typically combine NGRAM filtering with vector search to build context-aware retrieval, for example:

Search vectors for semantic similarity and restrict results to title LIKE "%release%".
Search vectors but only inside documents where path LIKE "%/policies/%".

What NGRAM does

NGRAM indexing works by splitting text into short, overlapping substrings called n-grams.

Example (n = 3):

"Milvus" → "Mil", "ilv", "lvu", "vus"

During ingestion, these n-grams are stored in a lookup structure so that at query time, Dodil can quickly narrow down to a small set of candidate rows before doing an exact LIKE check.

How it works

NGRAM indexing happens in two phases:

Build time (ingest / index build)
- Split each string into n-grams.
- Store which rows contain which grams.
Query time (filter acceleration)
- Extract the non-wildcard substring from your LIKE pattern.
- Break it into grams.
- Intersect candidate lists.
- Apply the original LIKE filter only to the candidates.

Key settings

When creating an NGRAM index you configure a range:

min_gram: the smallest gram size to generate
max_gram: the largest gram size to generate

Practical rules:

If the query substring length is smaller than min_gram, the index can’t help and the query may fall back to a broader scan.
If the query substring length is between min_gram and max_gram, it can be handled efficiently.
If the query substring length is larger than max_gram, it is decomposed using a window size of max_gram.

Important behavior notes

NGRAM decomposition is character-based and language-agnostic.
Spaces and punctuation are treated as characters.
Matching is case-sensitive. ("Database" and "database" are different.)
Wider ranges (larger max_gram, smaller min_gram) create more grams and may increase index size.

Create an NGRAM index

You can create an NGRAM index on:

a VARCHAR field (common)
a specific path inside a JSON field (advanced)

Below is a Dodil-style example showing the flow and typical parameters.

Example: Create on a VARCHAR field


from dodil import Client
from dodil.vbase import VBaseConfig
 
# Python 3.10+
 
# 1) Authorize
c = Client(
    service_account_id="...",
    service_account_secret="...",
)
 
# 2) Connect to VBase
vbase = c.vbase.connect(
    VBaseConfig(
        host="vbase-db-<id>.infra.dodil.cloud",
        port=443,
        scheme="https",
        db_name="db_<id>",
    )
)
 
# 3) Get a collection and create an NGRAM index on a text field
col = vbase.collection("my_collection")
 
# Note: API names may differ slightly depending on your SDK version.
col.create_index(
    field_name="title",
    index_type="NGRAM",
    params={
        "min_gram": 2,
        "max_gram": 3,
    },
)
 
# 4) (Optional) Load the collection so queries are fast right away
col.load()

Example: Use NGRAM in a filter


# Find only rows where `title` contains the substring "vector"
results = col.query(
    filter='title LIKE "%vector%"',
    limit=20,
)
 
for row in results:
    print(row["title"])

When should you use NGRAM?

Use NGRAM when:

You rely on LIKE filters on text fields and they’re becoming slow.
You want to combine keyword-like filtering with vector similarity search.
You need efficient wildcard filtering on metadata, paths, titles, tags, or identifiers.

Avoid NGRAM when:

You only do exact matches (=). (A different scalar index may be a better fit.)
You filter on very short strings (shorter than min_gram).

Tips for choosing min_gram / max_gram

Start with min_gram=2, max_gram=3 for general-purpose substring filtering.
If users search for longer substrings (e.g. 5–10 chars), increasing max_gram can help.
If memory/index size becomes a concern, keep the range tight and avoid very large max_gram.