SCANN
SCANN is an approximate nearest neighbor (ANN) index powered by Google’s ScaNN library. In VBase, it’s a great choice when you want fast similarity search at scale while still keeping good accuracy/recall for high‑dimensional embeddings. citeturn1view0
When to use SCANN
Use SCANN when:
- You have large collections (millions+ vectors) and want low latency.
- You can tolerate approximate search in exchange for speed.
- You’re using metrics like COSINE, L2, or IP (inner product). citeturn1view0
How SCANN works (in plain terms)
ScaNN typically runs vector search in three phases: citeturn1view0
- Partitioning: splits vectors into clusters so the query only checks the most relevant clusters.
- Quantization: compresses vectors to speed up candidate selection (with special handling that works well for inner‑product style similarity).
- Re‑ranking: rechecks a short list of candidates with more precise scoring to produce the final Top‑K.
You control the main speed/accuracy trade‑offs using:
with_raw_data(accuracy vs storage/memory)reorder_k(accuracy vs latency)
Build a SCANN index
Below is a typical flow using the Dodil SDK (Python 3.10+).
from dodil import Client
from dodil.vbase import VBaseConfig
# Authenticate (service account)
c = Client(
service_account_id="...",
service_account_secret="...",
)
# Connect to your VBase database
vbase = c.vbase.connect(
VBaseConfig(
host="vbase-db-<id>.infra.dodil.cloud",
port=443,
scheme="https",
db_name="db_<id>",
)
)
# Create a SCANN index on a vector field
vbase.create_index(
collection="your_collection_name",
field="your_vector_field_name",
index_name="vector_index",
index_type="SCANN",
metric_type="L2", # or "COSINE" / "IP"
params={
"with_raw_data": True,
# Optional (recommended for large datasets):
# "nlist": 1024,
},
)Notes:
with_raw_data=Truestores the original vectors alongside quantized data, which usually improves re‑ranking quality (at the cost of storage). citeturn1view0nlistcontrols how many clusters are created (more clusters can speed up coarse search but may hurt recall if clusters get too small). citeturn1view0
Search with SCANN
After inserting data and building the index, you can search with SCANN‑specific parameters:
query_vec = [0.1, 0.2, 0.3, 0.4, 0.5]
results = vbase.search(
collection="your_collection_name",
vector_field="your_vector_field_name",
vectors=[query_vec],
limit=10,
params={
"nprobe": 8,
"reorder_k": 40,
},
)
print(results)nprobecontrols how many clusters are searched for candidates.reorder_kcontrols how many candidates are re‑ranked precisely.
Parameters you’ll tune most
Index build parameters
| Parameter | What it does | Typical range | Practical guidance |
|---|---|---|---|
nlist | Number of cluster units (how many partitions) | 1 – 65536 | Increase to speed up coarse search, but don’t over‑partition (tiny clusters can reduce recall). Decrease to improve recall (at the cost of speed). citeturn1view0 |
with_raw_data | Stores original vectors in addition to quantized representation | true / false (default true) | Keep true when accuracy matters. Use false when storage/memory is your bottleneck and you can accept slightly lower accuracy. citeturn1view0 |
Search parameters
| Parameter | What it does | Typical range | Practical guidance |
|---|---|---|---|
reorder_k | Number of candidates to refine during re‑ranking | ≥ 1 | Higher improves recall/quality but increases latency. A good starting point is 2–5× limit. citeturn1view0 |
nprobe | Number of clusters to search | 1 – nlist (default 8) | Higher improves recall but increases latency. Increase gradually until recall is acceptable. citeturn1view0 |
Quick tuning recipe
If results look too approximate:
- Increase
reorder_k(often the biggest quality win). - Increase
nprobe. - If you can afford it, ensure
with_raw_data=true.
If latency is too high:
- Decrease
reorder_k. - Decrease
nprobe. - Consider increasing
nlist(if you have enough data and your recall stays acceptable).
Related: Index types, metrics, and search parameters are shared concepts across VBase indexes—SCANN just gives you an excellent speed/quality balance on large datasets.