Quickstart

Five minutes from here you’ll have a bucket where every uploaded PDF auto-indexes into a vector collection and is searchable.

We’ll wire up the full chain: source → pipeline → rule → upload → ingest job → search.

This is the canonical RAG-ingest flow. The same shape works for warehouse pipelines (PDFs → rows) and free pipelines (PDFs → side-effects) — see Recipes when you’re done.

Prerequisites

dodil CLI installed and dodil login done — see CLI Basics.

A bucket. We’ll use kb-quickstart. If you don’t have one:


dodil k3 bucket create kb-quickstart -d "Pipelines quickstart"

A vector engine + collection for the destination. If you don’t have one yet, follow Vector — Quickstart first and come back. We’ll assume:
- vector engine: vec-kb-quickstart
- vector collection: docs-index
- store_entity_id for the collection: <COLLECTION_STORE_ENTITY_ID> (you’ll capture this in step 2)

1. Discover available Scriptum templates

K3 ships a production template catalog — 24 audited templates across analysis, embedding, vision, and ecommerce. For “PDF → vector embeddings”, the right template is text_embedding_index (it accepts text, pdf, docx, html, audio, video). List the full catalog with:


dodil k3 template list -o json

Inspect the typed contract for our chosen template:


dodil k3 template get text_embedding_index -o json

The response includes a typed ScriptContract — input fields, output schema, required tools, and labels. K3 uses this server-side to validate the inputs you pass via options.

2. Resolve the destination’s `store_entity_id`

Your vector collection has a store_entity_id you’ll bind the pipeline to. Capture it from the Vector primitive:


# List collections in your bucket — pick the one you want as the destination
dodil k3 vector list --bucket kb-quickstart -o json
# Note the `store_entity_id` for `docs-index`

Save it as an env var for the next steps:


export COLLECTION_ID="<the store_entity_id from above>"

3. Create the pipeline

A pipeline binds a Scriptum template to a destination. Create one that indexes objects into your collection:


dodil k3 pipeline create embed-docs \
  --bucket kb-quickstart \
  --scriptum text_embedding_index \
  -o json

The response is the full Pipeline proto — capture pipeline_id:


export PIPELINE_ID="<pipeline_id from the create response>"

The pipeline’s store_entity_kind will be vector (derived server-side from the destination). See Core Concepts → Pipeline for the type.

If you want to set per-pipeline options (chunk size, overlap, etc.), pass --options-json '{"chunk_size":"1000","chunk_overlap":"150"}' at create time, or update later with dodil k3 pipeline update.

4. Find the internal source

You don’t create it — it already exists. Every bucket gets an internal S3 source the moment CreateBucket runs. You only need to look up its source_id to wire a rule against it:


curl -sS "https://k3.dev.dodil.io/kb-quickstart/sources" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  | jq -r '.sources[] | select(.name == "internal") | .sourceId'
 
export SOURCE_ID="<the source_id from above>"

The CLI doesn’t yet expose dodil k3 source list — that’s why we drop to curl here. See CLI Guide → source.

5. Create the rule

The rule is the trigger: “for every object in the internal source whose key matches *.pdf, run this pipeline.”


dodil k3 ingest add pdf-rule \
  --bucket kb-quickstart \
  --source "$SOURCE_ID" \
  --collection "$PIPELINE_ID" \
  --include "**/*.pdf"

The CLI’s --collection flag maps to the API’s pipeline_id field — historical naming. See dodil k3 ingest for the full surface.

6. Upload a PDF


# Grab any PDF, or use one of yours
curl -sSL https://arxiv.org/pdf/1706.03762.pdf -o attention.pdf
 
dodil k3 object create ./attention.pdf -b kb-quickstart -k papers/attention.pdf

That’s it. K3 matches your rule against the upload, spawns an IngestJob, runs the Scriptum template, and lands embeddings in your vector collection.

7. Watch the ingest job


# All jobs in the bucket — the most recent is yours
dodil k3 ingest jobs --bucket kb-quickstart -o json
 
# Or filter by pipeline
dodil k3 ingest jobs --bucket kb-quickstart --pipeline-id "$PIPELINE_ID" -o json

Status progression for a successful run:


PENDING ─► PROCESSING ─► COMPLETED
                │
                ▼  on transient error
            RETRYING ─► PROCESSING ─► COMPLETED   (K3 retries automatically)
                │
                ▼  on permanent error
              FAILED

Once COMPLETED, the job’s chunks_created and embeddings_created counters tell you how the object decomposed. See Core Concepts → Ingest job for the full status enum.

You can also inspect the object directly — pipeline_statuses on ObjectInfo reflects per-rule indexing state:


dodil k3 object show papers/attention.pdf -b kb-quickstart -o json
# Look for `pipelineStatuses[]` — one entry per rule that ran

8. Search the result

The embeddings are now live in your vector collection. Query them via the Vector primitive:


dodil k3 search --bucket kb-quickstart --collection docs-index \
  --query "what is multi-head attention" -o json

You should see chunks from papers/attention.pdf ranked by similarity.

What you just built

Step	Entity created	Proto type
3	Pipeline	`dodil.k3.pipeline.v1.Pipeline`
5	Rule	`dodil.k3.ingest.v1.IngestRule`
6 → 7	Discovery event → Ingest job	`IngestJob`

Every subsequent PDF you PUT to kb-quickstart automatically follows the same chain. Add more rules for different extensions, MIMEs, or paths — they all dispatch in parallel, one ingest job per rule per object.

Cleanup


# Delete the rule (stops new ingests but keeps existing jobs)
dodil k3 ingest delete <rule_id> --bucket kb-quickstart
 
# Delete the pipeline (rule should already be gone or detached)
dodil k3 pipeline delete "$PIPELINE_ID" --bucket kb-quickstart
 
# Delete the object + the bucket
dodil k3 object remove papers/attention.pdf -b kb-quickstart
dodil k3 bucket delete kb-quickstart

The vector collection persists independently — drop it via the Vector primitive when you’re done.

Next steps

Recipes — full worked flows (PDF → vector, S3 → warehouse, replay & retry)
Core Concepts — every type signature, the event flow in detail
API Reference — gRPC + HTTP for Source / Pipeline / Ingest services
CLI Guide — every dodil k3 command in the pipeline domain