MinHash LSH
MinHash LSH is a specialized indexing technique for near-duplicate detection and set-similarity search.
Unlike semantic embeddings (which capture meaning), MinHash is designed to answer questions like:
- “Do these two items share mostly the same tokens?”
- “Is this record a duplicate or near-duplicate of something we’ve already stored?”
This is extremely useful for deduplication, de-duplicating ingestion pipelines, and detecting repeated / highly similar structured records.
VBase supports MinHash LSH (backed by Milvus) through a
MINHASH_LSHindex and theMHJACCARDmetric.
When should I use MinHash LSH?
Use MinHash LSH when your similarity is naturally expressed as overlap between sets.
Common examples:
- Document deduplication: avoid storing the same PDF/text multiple times.
- Support ticket deduplication: group tickets that reuse the same phrases/keywords.
- Product catalog cleanup: detect duplicate listings that share many attributes/tokens.
- Log / event deduplication: collapse repeated event payloads with tiny variations.
If you want “meaning similarity” (synonyms, paraphrases, intent), use dense embeddings + vector search instead.
The idea in one minute
- Convert each item into a token set (words, shingles, or concatenated structured fields).
- Compute a MinHash signature that approximates Jaccard similarity between token sets.
- Store that signature as a binary vector.
- Build a
MINHASH_LSHindex. - Search before insert to find near-duplicates.
Prerequisites
- Python 3.10+
- Install a MinHash implementation (this guide uses
datasketch) and NumPy:
pip install datasketch numpyGenerate MinHash signatures
In this guide we generate 256 hash values, each represented as a 64-bit integer.
- Total signature size =
256 Ă— 64 bits = 2048 bytes - VBase stores this as a
BINARY_VECTORwith dimension in bits:VECTOR_DIM = 256 Ă— 64 = 8192
from datasketch import MinHash
import numpy as np
MINHASH_DIM = 256
HASH_BIT_WIDTH = 64
def generate_minhash_signature(text: str, num_perm: int = MINHASH_DIM) -> bytes:
"""Return a MinHash signature as big-endian uint64 bytes."""
m = MinHash(num_perm=num_perm)
for token in text.lower().split():
m.update(token.encode("utf8"))
# 256 uint64 values -> 2048 bytes
return m.hashvalues.astype(">u8").tobytes()Optional: keep the original token set (for refined search)
By default, MinHash LSH gives you a fast approximate candidate set.
If you want more accurate Jaccard similarity (fewer false positives), you can store the raw token set and enable refinement:
- Store
token_setas aVARCHARfield - Build the index with
with_raw_data: True - Search with
mh_search_with_jaccard: True
def extract_token_set(text: str) -> str:
tokens = set(text.lower().split())
return " ".join(tokens)Create a collection for MinHash LSH
You need a schema with:
- A primary key (e.g.,
doc_id) - A
BINARY_VECTORfield for MinHash signatures - (Optional, recommended) a
VARCHARfield to storetoken_setfor refined search - (Optional) a
documentfield for storing the original text
Example schema (conceptual):
VECTOR_DIM = MINHASH_DIM Ă— HASH_BIT_WIDTH = 8192
MINHASH_DIM = 256
HASH_BIT_WIDTH = 64
VECTOR_DIM = MINHASH_DIM * HASH_BIT_WIDTH # bits
schema = {
"auto_id": False,
"enable_dynamic_field": False,
"fields": [
{"name": "doc_id", "dtype": "INT64", "is_primary": True},
{"name": "minhash_signature", "dtype": "BINARY_VECTOR", "dim": VECTOR_DIM},
{"name": "token_set", "dtype": "VARCHAR", "max_length": 1000},
{"name": "document", "dtype": "VARCHAR", "max_length": 1000},
],
}
vbase.create_collection("minhash_demo", schema=schema)Build the MINHASH_LSH index
To use MinHash search, create an index with:
index_type:MINHASH_LSHmetric_type:MHJACCARD- Parameters:
mh_element_bit_width: must match the bit width used in your signature (typically64)mh_lsh_band: controls the recall/speed trade-off (more bands typically increases recall but can cost more)with_raw_data: setTrueif you want refined Jaccard search usingtoken_set
index_params = {
"index_type": "MINHASH_LSH",
"metric_type": "MHJACCARD",
"params": {
"mh_element_bit_width": HASH_BIT_WIDTH,
"mh_lsh_band": 16,
"with_raw_data": True,
},
}
vbase.create_index(
collection_name="minhash_demo",
field_name="minhash_signature",
index_params=index_params,
)Insert data
documents = [
"machine learning algorithms process data automatically",
"deep learning uses neural networks to model patterns",
]
rows = []
for i, doc in enumerate(documents):
sig = generate_minhash_signature(doc)
token_str = extract_token_set(doc)
rows.append(
{
"doc_id": i,
"minhash_signature": sig,
"token_set": token_str,
"document": doc,
}
)
vbase.insert("minhash_demo", rows)Search for near-duplicates
Fast approximate search (LSH only)
This finds candidates quickly using only the MinHash signature + LSH index.
query = "machine learning algorithms process data"
q_sig = generate_minhash_signature(query)
results = vbase.search(
collection_name="minhash_demo",
data=[q_sig],
anns_field="minhash_signature",
limit=10,
output_fields=["doc_id", "document"],
)
print(results)Refined search (true Jaccard on your stored tokens)
If you built the index with with_raw_data: True and you store token_set, you can refine the search:
mh_search_with_jaccard: Truetells VBase to compute a more accurate Jaccard similarity using the raw token set.refine_kcontrols how many candidates to re-check with the refined calculation.
query = "machine learning algorithms process data"
q_sig = generate_minhash_signature(query)
results = vbase.search(
collection_name="minhash_demo",
data=[q_sig],
anns_field="minhash_signature",
limit=10,
output_fields=["doc_id", "document"],
search_params={
"mh_search_with_jaccard": True,
"refine_k": 50,
},
)
print(results)Practical recommendation
For production dedup pipelines, a common pattern is:
- Search-before-insert using MinHash LSH
- If the top hit is above your similarity threshold (or within your tolerance), treat as duplicate
- Otherwise, insert the new item and continue
This keeps your dataset clean and prevents duplicates from wasting storage, indexing time, and search quality.