Skip to Content
We are live but in Staging 🎉

NGRAM index

The NGRAM index is a scalar index designed to accelerate LIKE filters on text.

It’s useful when you want fast filtering on:

  • Prefix matches (e.g. "data%")
  • Suffix matches (e.g. "%json")
  • Infix / contains matches (e.g. "%vector%")
  • Wildcard patterns that would otherwise require scanning many rows

In Dodil, you typically combine NGRAM filtering with vector search to build context-aware retrieval, for example:

  • Search vectors for semantic similarity and restrict results to title LIKE "%release%".
  • Search vectors but only inside documents where path LIKE "%/policies/%".

What NGRAM does

NGRAM indexing works by splitting text into short, overlapping substrings called n-grams.

Example (n = 3):

  • "Milvus" → "Mil", "ilv", "lvu", "vus"

During ingestion, these n-grams are stored in a lookup structure so that at query time, Dodil can quickly narrow down to a small set of candidate rows before doing an exact LIKE check.

How it works

NGRAM indexing happens in two phases:

  1. Build time (ingest / index build)

    • Split each string into n-grams.
    • Store which rows contain which grams.
  2. Query time (filter acceleration)

    • Extract the non-wildcard substring from your LIKE pattern.
    • Break it into grams.
    • Intersect candidate lists.
    • Apply the original LIKE filter only to the candidates.

Key settings

When creating an NGRAM index you configure a range:

  • min_gram: the smallest gram size to generate
  • max_gram: the largest gram size to generate

Practical rules:

  • If the query substring length is smaller than min_gram, the index can’t help and the query may fall back to a broader scan.
  • If the query substring length is between min_gram and max_gram, it can be handled efficiently.
  • If the query substring length is larger than max_gram, it is decomposed using a window size of max_gram.

Important behavior notes

  • NGRAM decomposition is character-based and language-agnostic.
  • Spaces and punctuation are treated as characters.
  • Matching is case-sensitive. ("Database" and "database" are different.)
  • Wider ranges (larger max_gram, smaller min_gram) create more grams and may increase index size.

Create an NGRAM index

You can create an NGRAM index on:

  • a VARCHAR field (common)
  • a specific path inside a JSON field (advanced)

Below is a Dodil-style example showing the flow and typical parameters.

Example: Create on a VARCHAR field

from dodil import Client from dodil.vbase import VBaseConfig # Python 3.10+ # 1) Authorize c = Client( service_account_id="...", service_account_secret="...", ) # 2) Connect to VBase vbase = c.vbase.connect( VBaseConfig( host="vbase-db-<id>.infra.dodil.cloud", port=443, scheme="https", db_name="db_<id>", ) ) # 3) Get a collection and create an NGRAM index on a text field col = vbase.collection("my_collection") # Note: API names may differ slightly depending on your SDK version. col.create_index( field_name="title", index_type="NGRAM", params={ "min_gram": 2, "max_gram": 3, }, ) # 4) (Optional) Load the collection so queries are fast right away col.load()

Example: Use NGRAM in a filter

# Find only rows where `title` contains the substring "vector" results = col.query( filter='title LIKE "%vector%"', limit=20, ) for row in results: print(row["title"])

When should you use NGRAM?

Use NGRAM when:

  • You rely on LIKE filters on text fields and they’re becoming slow.
  • You want to combine keyword-like filtering with vector similarity search.
  • You need efficient wildcard filtering on metadata, paths, titles, tags, or identifiers.

Avoid NGRAM when:

  • You only do exact matches (=). (A different scalar index may be a better fit.)
  • You filter on very short strings (shorter than min_gram).

Tips for choosing min_gram / max_gram

  • Start with min_gram=2, max_gram=3 for general-purpose substring filtering.
  • If users search for longer substrings (e.g. 5–10 chars), increasing max_gram can help.
  • If memory/index size becomes a concern, keep the range tight and avoid very large max_gram.
Last updated on