Binary vectors are bit-packed vectors (0/1 values) stored as bytes. Theyâre ideal when your âembeddingâ is really a fingerprint or hash rather than a floatingâpoint embedding.
Common uses:
- Image / video deduplication with perceptual hashes (pHash, aHash, dHash)
- Near-duplicate detection for files or pages
- Malware / binary similarity fingerprints
- Fast set-like similarity comparisons
How binary vectors work
A binary vector has a dimension (dim) measured in bits.
dimmust be a multiple of 8 (because 8 bits = 1 byte). îciteîturn3view0î- When inserting/searching, the binary vector value must be provided as a byte array whose length is
dim / 8. îciteîturn4view4î
So if dim = 128, each vector is 16 bytes.
Distance metrics
Binary vectors are compared using metrics designed for bits. The most common is:
- HAMMING: counts how many bit positions differ (smaller = more similar). îciteîturn4view2î
Milvus-compatible backends may also support other binary metrics (for example Jaccard variants), but Hamming distance is the default choice for most fingerprint use cases.
Create a binary-vector collection
Below is an end-to-end example using the Dodil SDK (Python 3.10+).
from dodil import Client
from dodil.vbase import (
VBaseConfig,
FieldSchema,
CollectionSchema,
DataType,
IndexParams,
)
# Authorize
c = Client(
service_account_id="...",
service_account_secret="...",
)
# Connect to your VBase database
vbase = c.vbase.connect(
VBaseConfig(
host="vbase-db-<id>.infra.dodil.cloud",
port=443,
scheme="https",
db_name="db_<id>",
)
)
# 1) Define schema
DIM = 128 # bits (must be divisible by 8)
schema = CollectionSchema(
fields=[
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="binary_vector", dtype=DataType.BINARY_VECTOR, dim=DIM),
# Optional: any scalar metadata you want
FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=512),
],
description="Binary fingerprints for near-duplicate detection",
)
# 2) Define index params
index_params = IndexParams()
index_params.add_index(
field_name="binary_vector",
index_type="AUTOINDEX",
metric_type="HAMMING",
index_name="binary_vector_index",
)
# 3) Create collection
vbase.create_collection(
collection_name="fingerprints",
schema=schema,
index_params=index_params,
)
print("Collections:", vbase.list_collections())Notes:
AUTOINDEXsmooths the setup for most users. îciteîturn4view2î- If you later choose a manual index type, the metric must remain compatible with binary vectors.
Insert binary vectors
Binary vectors must be passed as bytes. A handy pattern is to maintain your fingerprint as a list of bits (0/1), then pack it into bytes.
def bits_to_bytes(bits: list[int]) -> bytes:
if len(bits) % 8 != 0:
raise ValueError("binary vector dim must be a multiple of 8")
out = bytearray(len(bits) // 8)
for i, b in enumerate(bits):
if b:
out[i // 8] |= (1 << (i % 8))
return bytes(out)
DIM = 128
row1_bits = ([1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0] + [0] * 112)
row2_bits = ([0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1] + [0] * 112)
rows = [
{"id": 1, "binary_vector": bits_to_bytes(row1_bits), "source": "img/a.png"},
{"id": 2, "binary_vector": bits_to_bytes(row2_bits), "source": "img/b.png"},
]
vbase.insert(collection_name="fingerprints", data=rows)This âbit packingâ approach matches how Milvus-compatible engines expect binary vectors to be inserted. îciteîturn4view3îturn4view4î
Search by binary vector
To search, provide a query vector as bytes and set search params (for IVF-style indexes, nprobe is common).
query_bits = row1_bits # pretend row1 is our query
query_vec = bits_to_bytes(query_bits)
res = vbase.search(
collection_name="fingerprints",
data=[query_vec],
anns_field="binary_vector",
search_params={"params": {"nprobe": 10}},
limit=5,
output_fields=["id", "source"],
)
print(res)During search, the query binary vector must also be a byte array with the same dim (in bits) you declared in the schema. îciteîturn4view4î
When to use binary vs dense vectors
Use binary vectors when:
- you already have (or can compute) a fingerprint/hash
- you want extremely compact storage
- you care about âbit differenceâ similarity (Hamming distance)
Use dense vectors when:
- you rely on ML embeddings (text/image/audio) and semantic similarity
- your vectors are float arrays (e.g., 768-dim embeddings)
If youâre unsure, start with dense vectors for semantic search, and use binary vectors for deduplication / near-duplicate detection workflows.
Troubleshooting
- âdim must be a multiple of 8â: ensure your
dimis divisible by 8 and you pack bits into bytes. - Empty / poor search results: confirm you created an index and chose a compatible metric (e.g.,
HAMMING). îciteîturn4view2î - Timeouts: for large collections, tune your index/search params (and ensure the collection is loaded if your cluster requires explicit load/release).