Skip to Content
We are live but in Staging 🎉

MinHash LSH

MinHash LSH is a specialized indexing technique for near-duplicate detection and set-similarity search.

Unlike semantic embeddings (which capture meaning), MinHash is designed to answer questions like:

  • “Do these two items share mostly the same tokens?”
  • “Is this record a duplicate or near-duplicate of something we’ve already stored?”

This is extremely useful for deduplication, de-duplicating ingestion pipelines, and detecting repeated / highly similar structured records.

VBase supports MinHash LSH (backed by Milvus) through a MINHASH_LSH index and the MHJACCARD metric.


When should I use MinHash LSH?

Use MinHash LSH when your similarity is naturally expressed as overlap between sets.

Common examples:

  • Document deduplication: avoid storing the same PDF/text multiple times.
  • Support ticket deduplication: group tickets that reuse the same phrases/keywords.
  • Product catalog cleanup: detect duplicate listings that share many attributes/tokens.
  • Log / event deduplication: collapse repeated event payloads with tiny variations.

If you want “meaning similarity” (synonyms, paraphrases, intent), use dense embeddings + vector search instead.


The idea in one minute

  1. Convert each item into a token set (words, shingles, or concatenated structured fields).
  2. Compute a MinHash signature that approximates Jaccard similarity between token sets.
  3. Store that signature as a binary vector.
  4. Build a MINHASH_LSH index.
  5. Search before insert to find near-duplicates.

Prerequisites

  • Python 3.10+
  • Install a MinHash implementation (this guide uses datasketch) and NumPy:
pip install datasketch numpy

Generate MinHash signatures

In this guide we generate 256 hash values, each represented as a 64-bit integer.

  • Total signature size = 256 Ă— 64 bits = 2048 bytes
  • VBase stores this as a BINARY_VECTOR with dimension in bits:
    • VECTOR_DIM = 256 Ă— 64 = 8192
from datasketch import MinHash import numpy as np MINHASH_DIM = 256 HASH_BIT_WIDTH = 64 def generate_minhash_signature(text: str, num_perm: int = MINHASH_DIM) -> bytes: """Return a MinHash signature as big-endian uint64 bytes.""" m = MinHash(num_perm=num_perm) for token in text.lower().split(): m.update(token.encode("utf8")) # 256 uint64 values -> 2048 bytes return m.hashvalues.astype(">u8").tobytes()

By default, MinHash LSH gives you a fast approximate candidate set.

If you want more accurate Jaccard similarity (fewer false positives), you can store the raw token set and enable refinement:

  • Store token_set as a VARCHAR field
  • Build the index with with_raw_data: True
  • Search with mh_search_with_jaccard: True
def extract_token_set(text: str) -> str: tokens = set(text.lower().split()) return " ".join(tokens)

Create a collection for MinHash LSH

You need a schema with:

  • A primary key (e.g., doc_id)
  • A BINARY_VECTOR field for MinHash signatures
  • (Optional, recommended) a VARCHAR field to store token_set for refined search
  • (Optional) a document field for storing the original text

Example schema (conceptual):

  • VECTOR_DIM = MINHASH_DIM Ă— HASH_BIT_WIDTH = 8192
MINHASH_DIM = 256 HASH_BIT_WIDTH = 64 VECTOR_DIM = MINHASH_DIM * HASH_BIT_WIDTH # bits schema = { "auto_id": False, "enable_dynamic_field": False, "fields": [ {"name": "doc_id", "dtype": "INT64", "is_primary": True}, {"name": "minhash_signature", "dtype": "BINARY_VECTOR", "dim": VECTOR_DIM}, {"name": "token_set", "dtype": "VARCHAR", "max_length": 1000}, {"name": "document", "dtype": "VARCHAR", "max_length": 1000}, ], } vbase.create_collection("minhash_demo", schema=schema)

Build the MINHASH_LSH index

To use MinHash search, create an index with:

  • index_type: MINHASH_LSH
  • metric_type: MHJACCARD
  • Parameters:
    • mh_element_bit_width: must match the bit width used in your signature (typically 64)
    • mh_lsh_band: controls the recall/speed trade-off (more bands typically increases recall but can cost more)
    • with_raw_data: set True if you want refined Jaccard search using token_set
index_params = { "index_type": "MINHASH_LSH", "metric_type": "MHJACCARD", "params": { "mh_element_bit_width": HASH_BIT_WIDTH, "mh_lsh_band": 16, "with_raw_data": True, }, } vbase.create_index( collection_name="minhash_demo", field_name="minhash_signature", index_params=index_params, )

Insert data

documents = [ "machine learning algorithms process data automatically", "deep learning uses neural networks to model patterns", ] rows = [] for i, doc in enumerate(documents): sig = generate_minhash_signature(doc) token_str = extract_token_set(doc) rows.append( { "doc_id": i, "minhash_signature": sig, "token_set": token_str, "document": doc, } ) vbase.insert("minhash_demo", rows)

Search for near-duplicates

Fast approximate search (LSH only)

This finds candidates quickly using only the MinHash signature + LSH index.

query = "machine learning algorithms process data" q_sig = generate_minhash_signature(query) results = vbase.search( collection_name="minhash_demo", data=[q_sig], anns_field="minhash_signature", limit=10, output_fields=["doc_id", "document"], ) print(results)

Refined search (true Jaccard on your stored tokens)

If you built the index with with_raw_data: True and you store token_set, you can refine the search:

  • mh_search_with_jaccard: True tells VBase to compute a more accurate Jaccard similarity using the raw token set.
  • refine_k controls how many candidates to re-check with the refined calculation.
query = "machine learning algorithms process data" q_sig = generate_minhash_signature(query) results = vbase.search( collection_name="minhash_demo", data=[q_sig], anns_field="minhash_signature", limit=10, output_fields=["doc_id", "document"], search_params={ "mh_search_with_jaccard": True, "refine_k": 50, }, ) print(results)

Practical recommendation

For production dedup pipelines, a common pattern is:

  1. Search-before-insert using MinHash LSH
  2. If the top hit is above your similarity threshold (or within your tolerance), treat as duplicate
  3. Otherwise, insert the new item and continue

This keeps your dataset clean and prevents duplicates from wasting storage, indexing time, and search quality.

Last updated on