Phrase match lets you filter records by checking whether a specific phrase appears inside a text field.
This is useful when you want precise wording (or near-precise wording) in addition to vector similarity.
Examples:
- Support tickets: retrieve similar tickets, but only when the text contains “refund policy”.
- Legal / compliance: only return documents containing “terms of service”.
- Product search: search by embeddings, but keep results that mention “wireless charging”.
In Dodil VBase, phrase match is used as a scalar filter:
- In a query, it returns only the records that match the phrase.
- In a vector search, it filters candidates first, then vectors are used to rank the remaining records.
How phrase match works
When phrase match is enabled for a string field, VBase stores an inverted index with token positions. That means VBase can efficiently answer questions like:
- Do the tokens appear in order?
- Are the tokens adjacent?
- How many extra tokens can appear in-between?
That last point is controlled by slop.
Enable phrase match
Phrase match works on VARCHAR fields (string fields).
To enable it, set both of the following on the field:
enable_analyzer: true(tokenizes the text)enable_match: true(builds the inverted index needed for matching)
Example schema
# Example: enable phrase match on a VARCHAR field called "text"
from dodil import Client
c = Client(
service_account_id="...",
service_account_secret="...",
)
vbase = c.vbase.connect(
VBaseConfig(
host="vbase-db-<id>.infra.dodil.cloud",
port=443,
scheme="https",
db_name="db_<id>",
)
)
schema = vbase.create_schema(enable_dynamic_field=False)
schema.add_field(
field_name="id",
datatype="INT64",
is_primary=True,
auto_id=True,
)
schema.add_field(
field_name="text",
datatype="VARCHAR",
max_length=1000,
enable_analyzer=True,
enable_match=True,
)
schema.add_field(
field_name="embeddings",
datatype="FLOAT_VECTOR",
dim=768,
)
vbase.create_collection(
collection_name="tech_articles",
schema=schema,
)Optional: choose an analyzer
Your analyzer decides how text is tokenized, and that directly affects phrase matching.
- Default analyzer: typically lowercases and tokenizes by whitespace/punctuation.
- Language analyzers (e.g., English): handle stemming/stop-words differently.
If you need a specific analyzer, pass analyzer_params.
schema.add_field(
field_name="text",
datatype="VARCHAR",
max_length=1000,
enable_analyzer=True,
analyzer_params={"type": "english"},
enable_match=True,
)Use phrase match
Use the PHRASE_MATCH expression inside your filter.
The expression is case-insensitive.
Syntax
PHRASE_MATCH(field_name, phrase, slop)field_name: theVARCHARfield to match on.phrase: the phrase you want to find.slop(optional): how much flexibility is allowed.
slop behavior:
0(default): exact phrase only (same order, adjacent tokens).1: allows a small gap (e.g., one extra token in-between).2+: increasingly flexible (may allow bigger gaps and, depending on token positions, even reversed order).
Examples
Assume your collection has a text field and an embeddings vector field.
Query with phrase match
PHRASE_MATCH can be used in a query as a plain filter.
filter_expr = "PHRASE_MATCH(text, 'machine learning')" # slop defaults to 0
rows = vbase.query(
collection_name="tech_articles",
filter=filter_expr,
output_fields=["id", "text"],
)
for r in rows:
print(r["id"], r["text"])Vector search with phrase match
In search operations, phrase match narrows the candidates first, then embeddings rank the results.
query_vector = [0.01] * 768
filter_expr = "PHRASE_MATCH(text, 'learning machine', 1)"
hits = vbase.search(
collection_name="tech_articles",
anns_field="embeddings",
data=[query_vector],
filter=filter_expr,
search_params={"params": {"nprobe": 10}},
limit=10,
output_fields=["id", "text"],
)
for hit in hits[0]:
print(hit["id"], hit["score"], hit["text"])Slop examples
# exact phrase only
"PHRASE_MATCH(text, 'machine learning', 0)"
# allow a small gap
"PHRASE_MATCH(text, 'machine learning', 1)"
# more flexibility
"PHRASE_MATCH(text, 'machine learning', 2)"Considerations
- Storage cost: enabling phrase match builds an inverted index, which increases storage usage.
- Analyzer is sticky: once you create a collection with a specific analyzer, changing it usually requires creating a new collection.
- Escaping: your
filteris a string; make sure to escape quotes and backslashes properly.
Examples:
"PHRASE_MATCH(text, 'It\\'s fast')" # single quote inside single quotes
"PHRASE_MATCH(text, \"He said \\\"Hi\\\"\")" # double quotes inside double quotes