SCANN

SCANN is an approximate nearest neighbor (ANN) index powered by Google’s ScaNN library. In VBase, it’s a great choice when you want fast similarity search at scale while still keeping good accuracy/recall for high‑dimensional embeddings. citeturn1view0

When to use SCANN

Use SCANN when:

You have large collections (millions+ vectors) and want low latency.
You can tolerate approximate search in exchange for speed.
You’re using metrics like COSINE, L2, or IP (inner product). citeturn1view0

How SCANN works (in plain terms)

ScaNN typically runs vector search in three phases: citeturn1view0

Partitioning: splits vectors into clusters so the query only checks the most relevant clusters.
Quantization: compresses vectors to speed up candidate selection (with special handling that works well for inner‑product style similarity).
Re‑ranking: rechecks a short list of candidates with more precise scoring to produce the final Top‑K.

You control the main speed/accuracy trade‑offs using:

with_raw_data (accuracy vs storage/memory)
reorder_k (accuracy vs latency)

Build a SCANN index

Below is a typical flow using the Dodil SDK (Python 3.10+).


from dodil import Client
from dodil.vbase import VBaseConfig
 
# Authenticate (service account)
c = Client(
    service_account_id="...",
    service_account_secret="...",
)
 
# Connect to your VBase database
vbase = c.vbase.connect(
    VBaseConfig(
        host="vbase-db-<id>.infra.dodil.cloud",
        port=443,
        scheme="https",
        db_name="db_<id>",
    )
)
 
# Create a SCANN index on a vector field
vbase.create_index(
    collection="your_collection_name",
    field="your_vector_field_name",
    index_name="vector_index",
    index_type="SCANN",
    metric_type="L2",  # or "COSINE" / "IP"
    params={
        "with_raw_data": True,
        # Optional (recommended for large datasets):
        # "nlist": 1024,
    },
)

Notes:

with_raw_data=True stores the original vectors alongside quantized data, which usually improves re‑ranking quality (at the cost of storage). citeturn1view0
nlist controls how many clusters are created (more clusters can speed up coarse search but may hurt recall if clusters get too small). citeturn1view0

Search with SCANN

After inserting data and building the index, you can search with SCANN‑specific parameters:


query_vec = [0.1, 0.2, 0.3, 0.4, 0.5]
 
results = vbase.search(
    collection="your_collection_name",
    vector_field="your_vector_field_name",
    vectors=[query_vec],
    limit=10,
    params={
        "nprobe": 8,
        "reorder_k": 40,
    },
)
 
print(results)

nprobe controls how many clusters are searched for candidates.
reorder_k controls how many candidates are re‑ranked precisely.

Parameters you’ll tune most

Index build parameters

Parameter	What it does	Typical range	Practical guidance
`nlist`	Number of cluster units (how many partitions)	1 – 65536	Increase to speed up coarse search, but don’t over‑partition (tiny clusters can reduce recall). Decrease to improve recall (at the cost of speed). citeturn1view0
`with_raw_data`	Stores original vectors in addition to quantized representation	`true` / `false` (default `true`)	Keep `true` when accuracy matters. Use `false` when storage/memory is your bottleneck and you can accept slightly lower accuracy. citeturn1view0

Search parameters

Parameter	What it does	Typical range	Practical guidance
`reorder_k`	Number of candidates to refine during re‑ranking	≥ 1	Higher improves recall/quality but increases latency. A good starting point is 2–5× `limit`. citeturn1view0
`nprobe`	Number of clusters to search	1 – `nlist` (default `8`)	Higher improves recall but increases latency. Increase gradually until recall is acceptable. citeturn1view0

Quick tuning recipe

If results look too approximate:

Increase reorder_k (often the biggest quality win).
Increase nprobe.
If you can afford it, ensure with_raw_data=true.

If latency is too high:

Decrease reorder_k.
Decrease nprobe.
Consider increasing nlist (if you have enough data and your recall stays acceptable).

Related: Index types, metrics, and search parameters are shared concepts across VBase indexes—SCANN just gives you an excellent speed/quality balance on large datasets.