Quickstart
Five minutes from here you’ll have a bucket where every uploaded PDF auto-indexes into a vector collection and is searchable.
We’ll wire up the full chain: source → pipeline → rule → upload → ingest job → search.
This is the canonical RAG-ingest flow. The same shape works for warehouse pipelines (PDFs → rows) and free pipelines (PDFs → side-effects) — see Recipes when you’re done.
Prerequisites
dodilCLI installed anddodil logindone — see CLI Basics.- A bucket. We’ll use
kb-quickstart. If you don’t have one:dodil k3 bucket create kb-quickstart -d "Pipelines quickstart" - A vector engine + collection for the destination. If you don’t have one yet, follow Vector — Quickstart first and come back. We’ll assume:
- vector engine:
vec-kb-quickstart - vector collection:
docs-index - store_entity_id for the collection:
<COLLECTION_STORE_ENTITY_ID>(you’ll capture this in step 2)
- vector engine:
1. Discover available Scriptum templates
K3 ships a production template catalog — 24 audited templates across analysis, embedding, vision, and ecommerce. For “PDF → vector embeddings”, the right template is text_embedding_index (it accepts text, pdf, docx, html, audio, video). List the full catalog with:
dodil k3 template list -o jsonInspect the typed contract for our chosen template:
dodil k3 template get text_embedding_index -o jsonThe response includes a typed ScriptContract — input fields, output schema, required tools, and labels. K3 uses this server-side to validate the inputs you pass via options.
2. Resolve the destination’s store_entity_id
Your vector collection has a store_entity_id you’ll bind the pipeline to. Capture it from the Vector primitive:
# List collections in your bucket — pick the one you want as the destination
dodil k3 vector list --bucket kb-quickstart -o json
# Note the `store_entity_id` for `docs-index`Save it as an env var for the next steps:
export COLLECTION_ID="<the store_entity_id from above>"3. Create the pipeline
A pipeline binds a Scriptum template to a destination. Create one that indexes objects into your collection:
dodil k3 pipeline create embed-docs \
--bucket kb-quickstart \
--scriptum text_embedding_index \
-o jsonThe response is the full Pipeline proto — capture pipeline_id:
export PIPELINE_ID="<pipeline_id from the create response>"The pipeline’s store_entity_kind will be vector (derived server-side from the destination). See Core Concepts → Pipeline for the type.
If you want to set per-pipeline options (chunk size, overlap, etc.), pass
--options-json '{"chunk_size":"1000","chunk_overlap":"150"}'at create time, or update later withdodil k3 pipeline update.
4. Find the internal source
You don’t create it — it already exists. Every bucket gets an internal S3 source the moment CreateBucket runs. You only need to look up its source_id to wire a rule against it:
curl -sS "https://k3.dev.dodil.io/kb-quickstart/sources" \
-H "Authorization: Bearer $DODIL_TOKEN" \
| jq -r '.sources[] | select(.name == "internal") | .sourceId'
export SOURCE_ID="<the source_id from above>"The CLI doesn’t yet expose
dodil k3 source list— that’s why we drop tocurlhere. See CLI Guide → source.
5. Create the rule
The rule is the trigger: “for every object in the internal source whose key matches *.pdf, run this pipeline.”
dodil k3 ingest add pdf-rule \
--bucket kb-quickstart \
--source "$SOURCE_ID" \
--collection "$PIPELINE_ID" \
--include "**/*.pdf"The CLI’s
--collectionflag maps to the API’spipeline_idfield — historical naming. Seedodil k3 ingestfor the full surface.
6. Upload a PDF
# Grab any PDF, or use one of yours
curl -sSL https://arxiv.org/pdf/1706.03762.pdf -o attention.pdf
dodil k3 object create ./attention.pdf -b kb-quickstart -k papers/attention.pdfThat’s it. K3 matches your rule against the upload, spawns an IngestJob, runs the Scriptum template, and lands embeddings in your vector collection.
7. Watch the ingest job
# All jobs in the bucket — the most recent is yours
dodil k3 ingest jobs --bucket kb-quickstart -o json
# Or filter by pipeline
dodil k3 ingest jobs --bucket kb-quickstart --pipeline-id "$PIPELINE_ID" -o jsonStatus progression for a successful run:
PENDING ─► PROCESSING ─► COMPLETED
│
▼ on transient error
RETRYING ─► PROCESSING ─► COMPLETED (K3 retries automatically)
│
▼ on permanent error
FAILEDOnce COMPLETED, the job’s chunks_created and embeddings_created counters tell you how the object decomposed. See Core Concepts → Ingest job for the full status enum.
You can also inspect the object directly — pipeline_statuses on ObjectInfo reflects per-rule indexing state:
dodil k3 object show papers/attention.pdf -b kb-quickstart -o json
# Look for `pipelineStatuses[]` — one entry per rule that ran8. Search the result
The embeddings are now live in your vector collection. Query them via the Vector primitive:
dodil k3 search --bucket kb-quickstart --collection docs-index \
--query "what is multi-head attention" -o jsonYou should see chunks from papers/attention.pdf ranked by similarity.
What you just built
| Step | Entity created | Proto type |
|---|---|---|
| 3 | Pipeline | dodil.k3.pipeline.v1.Pipeline |
| 5 | Rule | dodil.k3.ingest.v1.IngestRule |
| 6 → 7 | Discovery event → Ingest job | IngestJob |
Every subsequent PDF you PUT to kb-quickstart automatically follows the same chain. Add more rules for different extensions, MIMEs, or paths — they all dispatch in parallel, one ingest job per rule per object.
Cleanup
# Delete the rule (stops new ingests but keeps existing jobs)
dodil k3 ingest delete <rule_id> --bucket kb-quickstart
# Delete the pipeline (rule should already be gone or detached)
dodil k3 pipeline delete "$PIPELINE_ID" --bucket kb-quickstart
# Delete the object + the bucket
dodil k3 object remove papers/attention.pdf -b kb-quickstart
dodil k3 bucket delete kb-quickstartThe vector collection persists independently — drop it via the Vector primitive when you’re done.
Next steps
- Recipes — full worked flows (PDF → vector, S3 → warehouse, replay & retry)
- Core Concepts — every type signature, the event flow in detail
- API Reference — gRPC + HTTP for Source / Pipeline / Ingest services
- CLI Guide — every
dodil k3command in the pipeline domain