Skip to Content
We are live but in Staging 🎉

Binary vectors are bit-packed vectors (0/1 values) stored as bytes. They’re ideal when your “embedding” is really a fingerprint or hash rather than a floating‑point embedding.

Common uses:

  • Image / video deduplication with perceptual hashes (pHash, aHash, dHash)
  • Near-duplicate detection for files or pages
  • Malware / binary similarity fingerprints
  • Fast set-like similarity comparisons

How binary vectors work

A binary vector has a dimension (dim) measured in bits.

  • dim must be a multiple of 8 (because 8 bits = 1 byte). citeturn3view0
  • When inserting/searching, the binary vector value must be provided as a byte array whose length is dim / 8. citeturn4view4

So if dim = 128, each vector is 16 bytes.

Distance metrics

Binary vectors are compared using metrics designed for bits. The most common is:

  • HAMMING: counts how many bit positions differ (smaller = more similar). citeturn4view2

Milvus-compatible backends may also support other binary metrics (for example Jaccard variants), but Hamming distance is the default choice for most fingerprint use cases.

Create a binary-vector collection

Below is an end-to-end example using the Dodil SDK (Python 3.10+).

from dodil import Client from dodil.vbase import ( VBaseConfig, FieldSchema, CollectionSchema, DataType, IndexParams, ) # Authorize c = Client( service_account_id="...", service_account_secret="...", ) # Connect to your VBase database vbase = c.vbase.connect( VBaseConfig( host="vbase-db-<id>.infra.dodil.cloud", port=443, scheme="https", db_name="db_<id>", ) ) # 1) Define schema DIM = 128 # bits (must be divisible by 8) schema = CollectionSchema( fields=[ FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False), FieldSchema(name="binary_vector", dtype=DataType.BINARY_VECTOR, dim=DIM), # Optional: any scalar metadata you want FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=512), ], description="Binary fingerprints for near-duplicate detection", ) # 2) Define index params index_params = IndexParams() index_params.add_index( field_name="binary_vector", index_type="AUTOINDEX", metric_type="HAMMING", index_name="binary_vector_index", ) # 3) Create collection vbase.create_collection( collection_name="fingerprints", schema=schema, index_params=index_params, ) print("Collections:", vbase.list_collections())

Notes:

  • AUTOINDEX smooths the setup for most users. citeturn4view2
  • If you later choose a manual index type, the metric must remain compatible with binary vectors.

Insert binary vectors

Binary vectors must be passed as bytes. A handy pattern is to maintain your fingerprint as a list of bits (0/1), then pack it into bytes.

def bits_to_bytes(bits: list[int]) -> bytes: if len(bits) % 8 != 0: raise ValueError("binary vector dim must be a multiple of 8") out = bytearray(len(bits) // 8) for i, b in enumerate(bits): if b: out[i // 8] |= (1 << (i % 8)) return bytes(out) DIM = 128 row1_bits = ([1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0] + [0] * 112) row2_bits = ([0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1] + [0] * 112) rows = [ {"id": 1, "binary_vector": bits_to_bytes(row1_bits), "source": "img/a.png"}, {"id": 2, "binary_vector": bits_to_bytes(row2_bits), "source": "img/b.png"}, ] vbase.insert(collection_name="fingerprints", data=rows)

This “bit packing” approach matches how Milvus-compatible engines expect binary vectors to be inserted. citeturn4view3turn4view4

Search by binary vector

To search, provide a query vector as bytes and set search params (for IVF-style indexes, nprobe is common).

query_bits = row1_bits # pretend row1 is our query query_vec = bits_to_bytes(query_bits) res = vbase.search( collection_name="fingerprints", data=[query_vec], anns_field="binary_vector", search_params={"params": {"nprobe": 10}}, limit=5, output_fields=["id", "source"], ) print(res)

During search, the query binary vector must also be a byte array with the same dim (in bits) you declared in the schema. citeturn4view4

When to use binary vs dense vectors

Use binary vectors when:

  • you already have (or can compute) a fingerprint/hash
  • you want extremely compact storage
  • you care about “bit difference” similarity (Hamming distance)

Use dense vectors when:

  • you rely on ML embeddings (text/image/audio) and semantic similarity
  • your vectors are float arrays (e.g., 768-dim embeddings)

If you’re unsure, start with dense vectors for semantic search, and use binary vectors for deduplication / near-duplicate detection workflows.

Troubleshooting

  • “dim must be a multiple of 8”: ensure your dim is divisible by 8 and you pack bits into bytes.
  • Empty / poor search results: confirm you created an index and chose a compatible metric (e.g., HAMMING). citeturn4view2
  • Timeouts: for large collections, tune your index/search params (and ensure the collection is loaded if your cluster requires explicit load/release).
Last updated on