Skip to Content
We are live but in Staging 🎉
Knowledge BaseVector DatabaseSchema and Data FieldsSchema & fields

A schema is the blueprint for a collection in VBase. It answers two practical questions:

  1. What do you want to store? (vectors + metadata)
  2. How do you want to query it? (similarity search + filters)

In DODIL, you define a schema by choosing a set of fields.

  • Primary key field: uniquely identifies each record.
  • Vector field(s): the embedding(s) you search by similarity.
  • Scalar fields: metadata you filter, sort, or return (strings, numbers, booleans, JSON, arrays, etc.).

Tip: Think of a collection like a database table. A schema is the table definition.

A quick mental model

When you search, VBase typically does two things:

  1. Vector similarity to find the closest embeddings.
  2. Metadata filtering to narrow results (for example: only documents from workspace="acme" or lang="en").

That’s why the schema matters: good field choices make your search faster, cheaper, and easier to reason about. citeturn2view0

Field types you’ll use most

Primary key

A primary key is required. It’s the stable identifier of every row.

Common choices:

  • Integer IDs if your data is generated internally.
  • String IDs if you already have natural identifiers (document ID, UUID, etc.).

Milvus-style schemas support configuring the primary field as auto-generated (auto ID) or user-supplied. citeturn2view0turn9view0

Vector fields

A vector field stores the embedding used for similarity search.

You’ll typically define:

  • vector dimension (e.g. 768, 1024, 1536)
  • metric (COSINE / IP / L2 depending on your embedding model)

Milvus-compatible schemas support multiple vector fields in one collection (useful for hybrid or multi-embedding designs). citeturn11view0

Scalar fields (metadata)

Scalar fields hold the metadata you’ll filter by or return in results.

Common examples:

  • source (string)
  • workspace_id (string)
  • created_at (timestamp or integer)
  • lang (string)
  • is_public (boolean)

Milvus schemas treat these as scalar fields and support common categories like string, number, boolean, and composite types. citeturn11view0turn2view0

Strings, numbers, booleans

String fields

Strings are great for identifiers and small categorical metadata.

Most engines require a max length for string fields (for example, 512). Keep it reasonable to avoid wasting memory. citeturn2view0turn9view0

Number fields

Numbers are ideal for:

  • timestamps (unix seconds/ms)
  • counters
  • ranking features

Typical numeric field types include integers and floats. citeturn9view0turn2view0

Boolean fields

Use booleans for simple on/off flags like is_public, is_archived, or has_images. citeturn9view0turn2view0

Composite fields: JSON and arrays

JSON fields

Use JSON when your metadata is flexible and not known ahead of time (for example, attributes that differ per document type).

JSON is great for storage and returning metadata, but be intentional with how you filter on it (it’s usually not as fast as filtering on dedicated scalar fields). citeturn9view0turn2view0

Array fields

Use arrays for small lists such as tags:

  • tags: ["finance", "invoice", "q4"]

Array fields require you to define:

  • the element type (e.g. string or int)
  • the max capacity (maximum elements)
  • element constraints such as max string length when elements are strings citeturn12view4

Nullable and defaults

If a field won’t exist on every record, configure it as nullable or provide a default value. This keeps ingestion resilient when some inputs are missing optional metadata. citeturn9view0turn2view0

A practical schema example

Here’s a schema that works well for most RAG-style apps (docs, web pages, tickets, PDFs):

  • id (primary key)
  • vector (embedding)
  • doc_id (string)
  • chunk_id (string)
  • source (string)
  • workspace_id (string)
  • created_at (int64)
  • is_public (bool)
  • attributes (json)
  • tags (array of strings)

This gives you:

  • fast similarity search (vector)
  • clean filtering (workspace, source, access flags)
  • flexibility (attributes JSON)
  • lightweight tagging (tags array)

How this looks in the DODIL Python SDK

Below is the standard way to connect to VBase (Python 3.10+):

from dodil import Client from dodil.vbase import VBaseConfig # Authorize with a Service Account c = Client( service_account_id="...", service_account_secret="...", ) vbase = c.vbase.connect( VBaseConfig( host="vbase-db-<id>.infra.dodil.cloud", port=443, scheme="https", db_name="db_<id>", ) ) print(vbase.list_collections())

Next: once you have a schema, you can create a collection and start inserting vectors.

Last updated on