Collections

A collection is where your vectors live.

In Dodil VBase, a collection is the logical container for:

Vector fields (embeddings)
Metadata fields (IDs, tags, timestamps, tenant IDs, etc.)
Indexes that make search fast

Think of a collection as a table: columns are fields, rows are entities (records). Each record can carry one or more vectors plus structured metadata.

VBase is Milvus-backed, so collection concepts map cleanly to the underlying database — Dodil simply provides a cleaner multi-tenant SDK surface and operations defaults.

When should I create a new collection?

A good rule: create a new collection when you need different search behavior, isolation, or schema.

Common strategies:

Per product feature: support_articles, customer_chats, code_snippets
Per tenant (SaaS multi-tenant): org_<id>_docs
Per environment: docs_dev, docs_prod
Per embedding model (or per version): docs_e5_large_v1, docs_e5_large_v2

If you only need to separate subsets inside the same schema, prefer metadata filters (or partitions) instead of spinning up many collections.

Core building blocks

Schema and fields

A collection schema defines:

Primary key field: uniquely identifies each record
Vector field(s): one or more embeddings with a fixed dimension
Scalar fields: metadata you filter on (strings, ints, booleans, timestamps, etc.)

Typical metadata fields you’ll want from day one:

tenant_id (or org_id) for isolation
source (file, url, connector)
doc_id / chunk_id
created_at / updated_at
tags or category

Optional metadata (dynamic fields)

If your metadata is not fixed (different keys per document), enable dynamic fields so you can attach arbitrary attributes without redesigning the schema.

Use dynamic metadata for:

connector-specific properties (e.g., jira_issue_key, slack_channel_id)
extraction attributes (language, mime type, page number)
enrichment signals (confidence scores, classification labels)

Keep “hot” filter fields (used in most queries) as explicit scalar fields for performance and clarity.

Primary key and Auto ID

Every record should have a globally unique primary key.

You typically have two options:

Bring your own ID (recommended): use your existing chunk_id / document_id / UUID.
Auto-generated IDs: let the database assign IDs during insert.

Bring-your-own ID makes upserts, dedupe, and cross-system traceability much easier.

Indexes

Indexes are what turn “vector storage” into “vector search”.

In practice you’ll manage two kinds of indexes:

Vector indexes (mandatory for production search performance)
Scalar indexes for fast metadata filtering (tenant, time ranges, categories)

A healthy default workflow is:

Create collection + schema
Insert data
Build indexes
Load collection into memory (or warm it)

Load and release

Search and query are typically memory-intensive.

Load a collection to make it available for low-latency search.
Release a collection when it’s not actively used to save resources.

For production, you usually keep your “hot” collections loaded and release rarely used ones.

Partitions

Partitions are subsets of a collection that share the same schema.

Use partitions when you need coarse segmentation such as:

large “dataset buckets” you query independently
time-based separation (e.g., 2026_01, 2026_02)
customer tiers / regions when filters become very repetitive

If your segmentation is flexible and you always rely on metadata conditions, filters are often enough.

Shards

Shards are horizontal slices that affect write throughput.

If you expect high ingestion rates (many writers or heavy streaming), you may want more shards at collection creation time. Treat this as a capacity planning lever rather than an application-level concept.

Aliases

Aliases let you point a stable name to a collection.

This is useful for:

zero-downtime migrations (switch alias from v1 → v2)
A/B testing search configurations
model upgrades (new embedding dimension / new index type)

Example pattern:

Create docs_v2
Backfill embeddings
Validate search quality
Switch alias docs_current → docs_v2

A practical lifecycle

Most teams end up following this lifecycle:

Design schema (vector dims + metadata fields)
Create collection
Ingest / upsert entities (often from VNG)
Build indexes
Load collection (warm)
Search / query with filters
Iterate (new fields, new index params, new collection version)

Best practices

Start with tenant isolation (at least a tenant_id field) — it prevents data leaks later.
Keep schemas stable. If you need a major schema change (dimension change, model swap), create a new collection.
Prefer aliases for versioning and safe rollouts.
Use filters for fine-grained selection; use partitions for coarse recurring segmentation.
Track provenance metadata (source, doc_id, chunk_id) so you can debug and rebuild.

Learn how to create your first collection in VBase
Explore indexes and search configuration
Use VNG to ingest documents and stream chunks into collections