NGRAM index
The NGRAM index is a scalar index designed to accelerate LIKE filters on text.
It’s useful when you want fast filtering on:
- Prefix matches (e.g.
"data%") - Suffix matches (e.g.
"%json") - Infix / contains matches (e.g.
"%vector%") - Wildcard patterns that would otherwise require scanning many rows
In Dodil, you typically combine NGRAM filtering with vector search to build context-aware retrieval, for example:
- Search vectors for semantic similarity and restrict results to
title LIKE "%release%". - Search vectors but only inside documents where
path LIKE "%/policies/%".
What NGRAM does
NGRAM indexing works by splitting text into short, overlapping substrings called n-grams.
Example (n = 3):
"Milvus"→"Mil","ilv","lvu","vus"
During ingestion, these n-grams are stored in a lookup structure so that at query time, Dodil can quickly narrow down to a small set of candidate rows before doing an exact LIKE check.
How it works
NGRAM indexing happens in two phases:
-
Build time (ingest / index build)
- Split each string into n-grams.
- Store which rows contain which grams.
-
Query time (filter acceleration)
- Extract the non-wildcard substring from your
LIKEpattern. - Break it into grams.
- Intersect candidate lists.
- Apply the original
LIKEfilter only to the candidates.
- Extract the non-wildcard substring from your
Key settings
When creating an NGRAM index you configure a range:
min_gram: the smallest gram size to generatemax_gram: the largest gram size to generate
Practical rules:
- If the query substring length is smaller than
min_gram, the index can’t help and the query may fall back to a broader scan. - If the query substring length is between
min_gramandmax_gram, it can be handled efficiently. - If the query substring length is larger than
max_gram, it is decomposed using a window size ofmax_gram.
Important behavior notes
- NGRAM decomposition is character-based and language-agnostic.
- Spaces and punctuation are treated as characters.
- Matching is case-sensitive. (
"Database"and"database"are different.) - Wider ranges (larger
max_gram, smallermin_gram) create more grams and may increase index size.
Create an NGRAM index
You can create an NGRAM index on:
- a
VARCHARfield (common) - a specific path inside a
JSONfield (advanced)
Below is a Dodil-style example showing the flow and typical parameters.
Example: Create on a VARCHAR field
from dodil import Client
from dodil.vbase import VBaseConfig
# Python 3.10+
# 1) Authorize
c = Client(
service_account_id="...",
service_account_secret="...",
)
# 2) Connect to VBase
vbase = c.vbase.connect(
VBaseConfig(
host="vbase-db-<id>.infra.dodil.cloud",
port=443,
scheme="https",
db_name="db_<id>",
)
)
# 3) Get a collection and create an NGRAM index on a text field
col = vbase.collection("my_collection")
# Note: API names may differ slightly depending on your SDK version.
col.create_index(
field_name="title",
index_type="NGRAM",
params={
"min_gram": 2,
"max_gram": 3,
},
)
# 4) (Optional) Load the collection so queries are fast right away
col.load()Example: Use NGRAM in a filter
# Find only rows where `title` contains the substring "vector"
results = col.query(
filter='title LIKE "%vector%"',
limit=20,
)
for row in results:
print(row["title"])When should you use NGRAM?
Use NGRAM when:
- You rely on
LIKEfilters on text fields and they’re becoming slow. - You want to combine keyword-like filtering with vector similarity search.
- You need efficient wildcard filtering on metadata, paths, titles, tags, or identifiers.
Avoid NGRAM when:
- You only do exact matches (
=). (A different scalar index may be a better fit.) - You filter on very short strings (shorter than
min_gram).
Tips for choosing min_gram / max_gram
- Start with
min_gram=2,max_gram=3for general-purpose substring filtering. - If users search for longer substrings (e.g. 5–10 chars), increasing
max_gramcan help. - If memory/index size becomes a concern, keep the range tight and avoid very large
max_gram.