RAG knowledge base

RAG (retrieval-augmented generation) is the flagship AI use case: instead of hoping a model already knows your documents, you retrieve the relevant passages at question time and hand them to the model as context. The loop is always the same four steps:

Ingest — collect your documents (PDFs, docs, text, HTML).
Embed — turn each chunk into a vector that captures its meaning.
Vector search — embed the question, find the nearest chunks.
LLM answer — feed those chunks to a chat model and let it write a grounded answer.

On Dodil, three products cover this end to end:

K3 — object storage + ingest pipelines + a managed vector engine. Drop documents in a bucket and K3 chunks, embeds, and indexes them for you.
Ignite Models — managed, OpenAI-compatible inference: embeddings and chat completions over a global model catalog.
VBase — a managed Milvus 2.6 vector database you drive with the standard Milvus SDK when you want to own the schema and index.

You can build RAG two ways, and this page shows both:

Humans usually start with the CLI — the K3 managed path. One bucket, one pipeline template, upload, search. K3 does the chunking and embedding.
Apps and agents use the typed API — the composable Python path. You wire Ignite Models (embeddings + chat) to VBase (vectors) yourself, for full control.

What you’ll build

A small searchable knowledge base and a simple answer function on top of it:

a corpus of documents you can grow at any time,
semantic search that returns the most relevant chunks for a question,
an answer(question) step that retrieves chunks and asks an LLM to answer using only that context.

The CLI way gets you there with managed K3 primitives. The programmatic way assembles the same thing from Ignite Models + VBase — the path that scales into production and that agents discover by reflection.

The CLI way

This is the K3 managed path. K3 owns ingest, chunking, embedding, and the vector index — you create a bucket, attach an embedding template, and upload. The full reference is the K3 RAG knowledge base recipe .

Prerequisites: the dodil CLI installed and dodil login done.

1. Create the bucket and turn on the vector engine


# Storage: create the bucket
dodil k3 bucket create kb-platform -d "RAG knowledge base"
 
# Vector: configure the engine (auto mode = K3 provisions VBase for you)
dodil k3 vector store create -b kb-platform -m auto
 
# Wait for the engine to go ACTIVE (usually < 60s)
until dodil k3 vector store get -b kb-platform -o json | jq -e '.status == "ENGINE_STATUS_ACTIVE"' > /dev/null; do
  echo "  waiting for engine..."
  sleep 5
done
echo "engine ACTIVE"

2. Attach an ingest pipeline via a template

K3 ships embedding templates that bundle the chunking strategy, embedding model, and index settings. For PDFs, docx, HTML, and plain text, use text_embedding_index. Browse what’s available first:


dodil k3 vector templates -o json | jq '.templates[] | {id, modalities, acceptedExtensions}'

Create a pipeline-mode collection. In one call, K3 creates the collection, a Scriptum pipeline bound to the template, and an auto-generated ingest rule whose globs come from the template’s accepted extensions:


dodil k3 vector collection add docs -b kb-platform \
  --description "RAG corpus" \
  --template text_embedding_index
 
# Capture the bound pipeline id for watching ingest jobs later
export PIPELINE_ID=$(dodil k3 vector collection get docs -b kb-platform -o json | jq -r '.embedPipelineId')

3. Upload documents — they auto-chunk and embed

Every object that matches the rule’s globs fires an ingest job: K3 chunks it, embeds the chunks with the template’s model, and writes the vectors into the docs collection. No extra wiring.


curl -sSL https://arxiv.org/pdf/1706.03762.pdf -o attention.pdf
dodil k3 object create ./attention.pdf -b kb-platform -k papers/attention.pdf

Watch the job land:


dodil k3 ingest jobs -b kb-platform -p "$PIPELINE_ID" -o json \
  | jq '.jobs[] | {object: .object.key, status, chunksCreated, embeddingsWritten}'

Status walks PENDING then PROCESSING then COMPLETED. On the happy path chunksCreated == embeddingsWritten. For bulk uploads, K3 speaks native S3 — aws s3 sync and boto3 work against it too; see Storage and S3 compatibility .

4. Search by meaning

The dodil k3 search command is the text-query happy path — quote the query, point it at the bucket and collection:


dodil k3 search "what is multi-head attention" -b kb-platform -c docs -k 5 -o json \
  | jq '.results[] | {score, object: .object.key}'

The CLI exposes only the dense text-query path. For production RAG you usually want hybrid (dense + BM25) + rerank and the chunk content returned for the LLM — those live on the Search API . The next step uses that richer endpoint.

5. Answer with an LLM

Retrieve the top chunks with content via the Search API, then feed them to a chat model. First, pull the chunk text into a shell variable:


# Hybrid + rerank + chunk content, ready to feed an LLM
CONTEXT=$(curl -sS -X POST "https://k3.dev.dodil.io/kb-platform/vector/search" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-platform",
    "collectionName": "docs",
    "text": "what is multi-head attention",
    "topK": 5,
    "searchMode": "SEARCH_MODE_AUTO",
    "rerank": true,
    "includeContent": true
  }' | jq -r '[.results[].content] | join("\n\n---\n\n")')

Then ask Ignite Models to answer using only that context. The Models API is OpenAI-compatible, so the chat call is a plain POST /v1/chat/completions:


curl -sS -X POST "https://api.dev.dodil.io/v1/chat/completions" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d "$(jq -n --arg ctx "$CONTEXT" '{
    model: "kimi-k2.5",
    messages: [
      { role: "system", content: "Answer the question using only the provided context. If the context does not contain the answer, say so." },
      { role: "user", content: ("Context:\n" + $ctx + "\n\nQuestion: What is multi-head attention?") }
    ],
    max_tokens: 300
  }')" | jq -r '.choices[0].message.content'

The dodil ignite models chat subcommand exists for quick smoke tests, but the current CLI build sends a fixed stub prompt — so to feed your own retrieved context, use the OpenAI-compatible chat endpoint (above) or an SDK (below). See the chat completions reference .

That’s the whole loop with managed primitives: upload to K3 → search K3 → answer with Ignite Models.

The programmatic way (Python)

This is the composable path — the one apps and agents use. You own each step: embeddings from Ignite Models (via the OpenAI SDK), vectors stored and searched in VBase (via the Milvus SDK), and the answer from Ignite Models chat. Nothing is hidden; you control the schema, the index, and the prompt.


pip install openai "pymilvus>=2.6,<2.7" requests

1. Get an IAM access token

Both Ignite Models and VBase authenticate with a Dodil IAM token. Apps use the OAuth 2.0 client-credentials flow against a Service Account — see Get an Access Token for the full flow in every language.


import os, requests
 
TOKEN_URL = "https://id.dev.dodil.io/realms/dodil/protocol/openid-connect/token"
 
def get_token() -> str:
    resp = requests.post(TOKEN_URL, data={
        "grant_type": "client_credentials",
        "client_id": os.environ["DODIL_SERVICE_ACCOUNT_ID"],
        "client_secret": os.environ["DODIL_SERVICE_ACCOUNT_SECRET"],
    })
    resp.raise_for_status()
    return resp.json()["access_token"]
 
TOKEN = get_token()

2. Embeddings via Ignite Models (OpenAI SDK)

Ignite Models speaks the OpenAI wire format verbatim. Point the official OpenAI client at the Ignite base URL and pass the IAM token as the api_key. We use arctic-embed-m-v2 — a 768-dimension text embedding model — so the vectors line up with the VBase schema below.


from openai import OpenAI
 
EMBED_MODEL = "arctic-embed-m-v2"   # 768-dim text embeddings; see the model catalog
EMBED_DIM = 768
 
models = OpenAI(
    base_url="https://api.dev.dodil.io/v1",
    api_key=TOKEN,                  # your IAM token IS the API key
)
 
def embed(texts: list[str]) -> list[list[float]]:
    resp = models.embeddings.create(model=EMBED_MODEL, input=texts)
    return [d.embedding for d in resp.data]

3. Store and search vectors in VBase (Milvus SDK)

VBase’s data plane is Milvus 2.6, so you connect with the standard pymilvus client. The uri (endpoint + port) and db_name come from GetServiceAccess or dodil vbase db use; the token is the same IAM token. See Connecting with the Milvus SDK .


from pymilvus import MilvusClient, DataType
 
vbase = MilvusClient(
    uri="https://<endpoint>:443",   # from GetServiceAccess / `dodil vbase db use`
    token=TOKEN,                     # your IAM token is your Milvus token
    db_name="<db_name>",
)

Define a schema — a primary key, the 768-dim dense vector, and the chunk text to return — and build an HNSW / COSINE index, which is a strong default for normalized text embeddings:


schema = vbase.create_schema(auto_id=False, enable_dynamic_field=True)
schema.add_field("id", DataType.VARCHAR, is_primary=True, max_length=64)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=EMBED_DIM)
schema.add_field("text", DataType.VARCHAR, max_length=8192)
schema.add_field("source", DataType.VARCHAR, max_length=256)
 
index_params = vbase.prepare_index_params()
index_params.add_index(
    field_name="embedding",
    index_type="HNSW",
    metric_type="COSINE",
    params={"M": 16, "efConstruction": 200},
)
 
vbase.create_collection(
    collection_name="kb",
    schema=schema,
    index_params=index_params,
)

Ingest your documents: chunk them however you like, embed the chunks with Ignite Models, and insert. (Use a real chunker for production — splitting on blank lines stands in here.)


def chunk(text: str) -> list[str]:
    return [p.strip() for p in text.split("\n\n") if p.strip()]
 
documents = {
    "attention.txt": open("attention.txt").read(),
    # ...add more files
}
 
rows, next_id = [], 0
for source, body in documents.items():
    chunks = chunk(body)
    vectors = embed(chunks)
    for c, v in zip(chunks, vectors):
        rows.append({"id": f"d{next_id}", "embedding": v, "text": c, "source": source})
        next_id += 1
 
vbase.insert(collection_name="kb", data=rows)
vbase.load_collection(collection_name="kb")   # load into memory before searching

Search: embed the question with the same model and run a nearest-neighbour query, returning the chunk text:


def retrieve(question: str, k: int = 5) -> list[dict]:
    qvec = embed([question])[0]
    hits = vbase.search(
        collection_name="kb",
        data=[qvec],
        anns_field="embedding",
        limit=k,
        search_params={"metric_type": "COSINE", "params": {"ef": 64}},
        output_fields=["text", "source"],
    )
    return [
        {"text": h["entity"]["text"], "source": h["entity"]["source"], "score": h["distance"]}
        for h in hits[0]
    ]

4. Answer with a retrieval-augmented prompt

Retrieve, build the context block, and ask an Ignite chat model to answer using only that context — reusing the same OpenAI client from step 2:


def answer(question: str) -> str:
    chunks = retrieve(question)
    context = "\n\n---\n\n".join(
        f"[{c['source']}] {c['text']}" for c in chunks
    )
 
    resp = models.chat.completions.create(
        model="kimi-k2.5",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the question using only the provided context. "
                    "Cite sources by their file name in square brackets. "
                    "If the context does not contain the answer, say you don't know."
                ),
            },
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
        max_tokens=400,
    )
    return resp.choices[0].message.content
 
print(answer("What is multi-head attention?"))

That’s the same RAG loop, fully composed: embed with Ignite Models → store and search in VBase → answer with Ignite Models. Swap the embedding model, tune the HNSW ef, add metadata filters, or add a rerank step — every piece is under your control.