Phrase match

Phrase match lets you filter records by checking whether a specific phrase appears inside a text field.

This is useful when you want precise wording (or near-precise wording) in addition to vector similarity.

Examples:

Support tickets: retrieve similar tickets, but only when the text contains “refund policy”.
Legal / compliance: only return documents containing “terms of service”.
Product search: search by embeddings, but keep results that mention “wireless charging”.

In Dodil VBase, phrase match is used as a scalar filter:

In a query, it returns only the records that match the phrase.
In a vector search, it filters candidates first, then vectors are used to rank the remaining records.

How phrase match works

When phrase match is enabled for a string field, VBase stores an inverted index with token positions. That means VBase can efficiently answer questions like:

Do the tokens appear in order?
Are the tokens adjacent?
How many extra tokens can appear in-between?

That last point is controlled by slop.

Enable phrase match

Phrase match works on VARCHAR fields (string fields).

To enable it, set both of the following on the field:

enable_analyzer: true (tokenizes the text)
enable_match: true (builds the inverted index needed for matching)

Example schema


# Example: enable phrase match on a VARCHAR field called "text"
from dodil import Client
 
c = Client(
    service_account_id="...",
    service_account_secret="...",
)
 
vbase = c.vbase.connect(
    VBaseConfig(
        host="vbase-db-<id>.infra.dodil.cloud",
        port=443,
        scheme="https",
        db_name="db_<id>",
    )
)
 
schema = vbase.create_schema(enable_dynamic_field=False)
 
schema.add_field(
    field_name="id",
    datatype="INT64",
    is_primary=True,
    auto_id=True,
)
 
schema.add_field(
    field_name="text",
    datatype="VARCHAR",
    max_length=1000,
    enable_analyzer=True,
    enable_match=True,
)
 
schema.add_field(
    field_name="embeddings",
    datatype="FLOAT_VECTOR",
    dim=768,
)
 
vbase.create_collection(
    collection_name="tech_articles",
    schema=schema,
)

Optional: choose an analyzer

Your analyzer decides how text is tokenized, and that directly affects phrase matching.

Default analyzer: typically lowercases and tokenizes by whitespace/punctuation.
Language analyzers (e.g., English): handle stemming/stop-words differently.

If you need a specific analyzer, pass analyzer_params.


schema.add_field(
    field_name="text",
    datatype="VARCHAR",
    max_length=1000,
    enable_analyzer=True,
    analyzer_params={"type": "english"},
    enable_match=True,
)

Use phrase match

Use the PHRASE_MATCH expression inside your filter.

The expression is case-insensitive.

Syntax


PHRASE_MATCH(field_name, phrase, slop)

field_name: the VARCHAR field to match on.
phrase: the phrase you want to find.
slop (optional): how much flexibility is allowed.

slop behavior:

0 (default): exact phrase only (same order, adjacent tokens).
1: allows a small gap (e.g., one extra token in-between).
2+: increasingly flexible (may allow bigger gaps and, depending on token positions, even reversed order).

Examples

Assume your collection has a text field and an embeddings vector field.

Query with phrase match

PHRASE_MATCH can be used in a query as a plain filter.


filter_expr = "PHRASE_MATCH(text, 'machine learning')"  # slop defaults to 0
 
rows = vbase.query(
    collection_name="tech_articles",
    filter=filter_expr,
    output_fields=["id", "text"],
)
 
for r in rows:
    print(r["id"], r["text"])

Vector search with phrase match

In search operations, phrase match narrows the candidates first, then embeddings rank the results.


query_vector = [0.01] * 768
 
filter_expr = "PHRASE_MATCH(text, 'learning machine', 1)"
 
hits = vbase.search(
    collection_name="tech_articles",
    anns_field="embeddings",
    data=[query_vector],
    filter=filter_expr,
    search_params={"params": {"nprobe": 10}},
    limit=10,
    output_fields=["id", "text"],
)
 
for hit in hits[0]:
    print(hit["id"], hit["score"], hit["text"])

Slop examples


# exact phrase only
"PHRASE_MATCH(text, 'machine learning', 0)"
 
# allow a small gap
"PHRASE_MATCH(text, 'machine learning', 1)"
 
# more flexibility
"PHRASE_MATCH(text, 'machine learning', 2)"

Considerations

Storage cost: enabling phrase match builds an inverted index, which increases storage usage.
Analyzer is sticky: once you create a collection with a specific analyzer, changing it usually requires creating a new collection.
Escaping: your filter is a string; make sure to escape quotes and backslashes properly.

Examples:


"PHRASE_MATCH(text, 'It\\'s fast')"          # single quote inside single quotes
"PHRASE_MATCH(text, \"He said \\\"Hi\\\"\")"  # double quotes inside double quotes