Skip to Content
We are live but in Staging 🎉

Phrase match lets you filter records by checking whether a specific phrase appears inside a text field.

This is useful when you want precise wording (or near-precise wording) in addition to vector similarity.

Examples:

  • Support tickets: retrieve similar tickets, but only when the text contains “refund policy”.
  • Legal / compliance: only return documents containing “terms of service”.
  • Product search: search by embeddings, but keep results that mention “wireless charging”.

In Dodil VBase, phrase match is used as a scalar filter:

  • In a query, it returns only the records that match the phrase.
  • In a vector search, it filters candidates first, then vectors are used to rank the remaining records.

How phrase match works

When phrase match is enabled for a string field, VBase stores an inverted index with token positions. That means VBase can efficiently answer questions like:

  • Do the tokens appear in order?
  • Are the tokens adjacent?
  • How many extra tokens can appear in-between?

That last point is controlled by slop.

Enable phrase match

Phrase match works on VARCHAR fields (string fields).

To enable it, set both of the following on the field:

  • enable_analyzer: true (tokenizes the text)
  • enable_match: true (builds the inverted index needed for matching)

Example schema

# Example: enable phrase match on a VARCHAR field called "text" from dodil import Client c = Client( service_account_id="...", service_account_secret="...", ) vbase = c.vbase.connect( VBaseConfig( host="vbase-db-<id>.infra.dodil.cloud", port=443, scheme="https", db_name="db_<id>", ) ) schema = vbase.create_schema(enable_dynamic_field=False) schema.add_field( field_name="id", datatype="INT64", is_primary=True, auto_id=True, ) schema.add_field( field_name="text", datatype="VARCHAR", max_length=1000, enable_analyzer=True, enable_match=True, ) schema.add_field( field_name="embeddings", datatype="FLOAT_VECTOR", dim=768, ) vbase.create_collection( collection_name="tech_articles", schema=schema, )

Optional: choose an analyzer

Your analyzer decides how text is tokenized, and that directly affects phrase matching.

  • Default analyzer: typically lowercases and tokenizes by whitespace/punctuation.
  • Language analyzers (e.g., English): handle stemming/stop-words differently.

If you need a specific analyzer, pass analyzer_params.

schema.add_field( field_name="text", datatype="VARCHAR", max_length=1000, enable_analyzer=True, analyzer_params={"type": "english"}, enable_match=True, )

Use phrase match

Use the PHRASE_MATCH expression inside your filter.

The expression is case-insensitive.

Syntax

PHRASE_MATCH(field_name, phrase, slop)
  • field_name: the VARCHAR field to match on.
  • phrase: the phrase you want to find.
  • slop (optional): how much flexibility is allowed.

slop behavior:

  • 0 (default): exact phrase only (same order, adjacent tokens).
  • 1: allows a small gap (e.g., one extra token in-between).
  • 2+: increasingly flexible (may allow bigger gaps and, depending on token positions, even reversed order).

Examples

Assume your collection has a text field and an embeddings vector field.

Query with phrase match

PHRASE_MATCH can be used in a query as a plain filter.

filter_expr = "PHRASE_MATCH(text, 'machine learning')" # slop defaults to 0 rows = vbase.query( collection_name="tech_articles", filter=filter_expr, output_fields=["id", "text"], ) for r in rows: print(r["id"], r["text"])

Vector search with phrase match

In search operations, phrase match narrows the candidates first, then embeddings rank the results.

query_vector = [0.01] * 768 filter_expr = "PHRASE_MATCH(text, 'learning machine', 1)" hits = vbase.search( collection_name="tech_articles", anns_field="embeddings", data=[query_vector], filter=filter_expr, search_params={"params": {"nprobe": 10}}, limit=10, output_fields=["id", "text"], ) for hit in hits[0]: print(hit["id"], hit["score"], hit["text"])

Slop examples

# exact phrase only "PHRASE_MATCH(text, 'machine learning', 0)" # allow a small gap "PHRASE_MATCH(text, 'machine learning', 1)" # more flexibility "PHRASE_MATCH(text, 'machine learning', 2)"

Considerations

  • Storage cost: enabling phrase match builds an inverted index, which increases storage usage.
  • Analyzer is sticky: once you create a collection with a specific analyzer, changing it usually requires creating a new collection.
  • Escaping: your filter is a string; make sure to escape quotes and backslashes properly.

Examples:

"PHRASE_MATCH(text, 'It\\'s fast')" # single quote inside single quotes "PHRASE_MATCH(text, \"He said \\\"Hi\\\"\")" # double quotes inside double quotes
Last updated on