Binary vectors

Binary vectors are bit-packed vectors (0/1 values) stored as bytes. They’re ideal when your “embedding” is really a fingerprint or hash rather than a floating‑point embedding.

Common uses:

Image / video deduplication with perceptual hashes (pHash, aHash, dHash)
Near-duplicate detection for files or pages
Malware / binary similarity fingerprints
Fast set-like similarity comparisons

How binary vectors work

A binary vector has a dimension (dim) measured in bits.

dim must be a multiple of 8 (because 8 bits = 1 byte). citeturn3view0
When inserting/searching, the binary vector value must be provided as a byte array whose length is dim / 8. citeturn4view4

So if dim = 128, each vector is 16 bytes.

Distance metrics

Binary vectors are compared using metrics designed for bits. The most common is:

HAMMING: counts how many bit positions differ (smaller = more similar). citeturn4view2

Milvus-compatible backends may also support other binary metrics (for example Jaccard variants), but Hamming distance is the default choice for most fingerprint use cases.

Create a binary-vector collection

Below is an end-to-end example using the Dodil SDK (Python 3.10+).


from dodil import Client
from dodil.vbase import (
    VBaseConfig,
    FieldSchema,
    CollectionSchema,
    DataType,
    IndexParams,
)
 
# Authorize
c = Client(
    service_account_id="...",
    service_account_secret="...",
)
 
# Connect to your VBase database
vbase = c.vbase.connect(
    VBaseConfig(
        host="vbase-db-<id>.infra.dodil.cloud",
        port=443,
        scheme="https",
        db_name="db_<id>",
    )
)
 
# 1) Define schema
DIM = 128  # bits (must be divisible by 8)
 
schema = CollectionSchema(
    fields=[
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
        FieldSchema(name="binary_vector", dtype=DataType.BINARY_VECTOR, dim=DIM),
        # Optional: any scalar metadata you want
        FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=512),
    ],
    description="Binary fingerprints for near-duplicate detection",
)
 
# 2) Define index params
index_params = IndexParams()
index_params.add_index(
    field_name="binary_vector",
    index_type="AUTOINDEX",
    metric_type="HAMMING",
    index_name="binary_vector_index",
)
 
# 3) Create collection
vbase.create_collection(
    collection_name="fingerprints",
    schema=schema,
    index_params=index_params,
)
 
print("Collections:", vbase.list_collections())

Notes:

AUTOINDEX smooths the setup for most users. citeturn4view2
If you later choose a manual index type, the metric must remain compatible with binary vectors.

Insert binary vectors

Binary vectors must be passed as bytes. A handy pattern is to maintain your fingerprint as a list of bits (0/1), then pack it into bytes.


def bits_to_bytes(bits: list[int]) -> bytes:
    if len(bits) % 8 != 0:
        raise ValueError("binary vector dim must be a multiple of 8")
 
    out = bytearray(len(bits) // 8)
    for i, b in enumerate(bits):
        if b:
            out[i // 8] |= (1 << (i % 8))
    return bytes(out)
 
DIM = 128
row1_bits = ([1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0] + [0] * 112)
row2_bits = ([0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1] + [0] * 112)
 
rows = [
    {"id": 1, "binary_vector": bits_to_bytes(row1_bits), "source": "img/a.png"},
    {"id": 2, "binary_vector": bits_to_bytes(row2_bits), "source": "img/b.png"},
]
 
vbase.insert(collection_name="fingerprints", data=rows)

This “bit packing” approach matches how Milvus-compatible engines expect binary vectors to be inserted. citeturn4view3turn4view4

Search by binary vector

To search, provide a query vector as bytes and set search params (for IVF-style indexes, nprobe is common).


query_bits = row1_bits  # pretend row1 is our query
query_vec = bits_to_bytes(query_bits)
 
res = vbase.search(
    collection_name="fingerprints",
    data=[query_vec],
    anns_field="binary_vector",
    search_params={"params": {"nprobe": 10}},
    limit=5,
    output_fields=["id", "source"],
)
 
print(res)

During search, the query binary vector must also be a byte array with the same dim (in bits) you declared in the schema. citeturn4view4

When to use binary vs dense vectors

Use binary vectors when:

you already have (or can compute) a fingerprint/hash
you want extremely compact storage
you care about “bit difference” similarity (Hamming distance)

Use dense vectors when:

you rely on ML embeddings (text/image/audio) and semantic similarity
your vectors are float arrays (e.g., 768-dim embeddings)

If you’re unsure, start with dense vectors for semantic search, and use binary vectors for deduplication / near-duplicate detection workflows.

Troubleshooting

“dim must be a multiple of 8”: ensure your dim is divisible by 8 and you pack bits into bytes.
Empty / poor search results: confirm you created an index and chose a compatible metric (e.g., HAMMING). citeturn4view2
Timeouts: for large collections, tune your index/search params (and ensure the collection is loaded if your cluster requires explicit load/release).