Skip to Content
We are live but in Staging 🎉
PipelinesQuickstart

Quickstart

Five minutes from here you’ll have a bucket where every uploaded PDF auto-indexes into a vector collection and is searchable.

We’ll wire up the full chain: source → pipeline → rule → upload → ingest job → search.

This is the canonical RAG-ingest flow. The same shape works for warehouse pipelines (PDFs → rows) and free pipelines (PDFs → side-effects) — see Recipes when you’re done.

Prerequisites

  • dodil CLI installed and dodil login done — see CLI Basics.
  • A bucket. We’ll use kb-quickstart. If you don’t have one:
    dodil k3 bucket create kb-quickstart -d "Pipelines quickstart"
  • A vector engine + collection for the destination. If you don’t have one yet, follow Vector — Quickstart first and come back. We’ll assume:
    • vector engine: vec-kb-quickstart
    • vector collection: docs-index
    • store_entity_id for the collection: <COLLECTION_STORE_ENTITY_ID> (you’ll capture this in step 2)

1. Discover available Scriptum templates

K3 ships a production template catalog — 24 audited templates across analysis, embedding, vision, and ecommerce. For “PDF → vector embeddings”, the right template is text_embedding_index (it accepts text, pdf, docx, html, audio, video). List the full catalog with:

dodil k3 template list -o json

Inspect the typed contract for our chosen template:

dodil k3 template get text_embedding_index -o json

The response includes a typed ScriptContract — input fields, output schema, required tools, and labels. K3 uses this server-side to validate the inputs you pass via options.

2. Resolve the destination’s store_entity_id

Your vector collection has a store_entity_id you’ll bind the pipeline to. Capture it from the Vector primitive:

# List collections in your bucket — pick the one you want as the destination dodil k3 vector list --bucket kb-quickstart -o json # Note the `store_entity_id` for `docs-index`

Save it as an env var for the next steps:

export COLLECTION_ID="<the store_entity_id from above>"

3. Create the pipeline

A pipeline binds a Scriptum template to a destination. Create one that indexes objects into your collection:

dodil k3 pipeline create embed-docs \ --bucket kb-quickstart \ --scriptum text_embedding_index \ -o json

The response is the full Pipeline proto — capture pipeline_id:

export PIPELINE_ID="<pipeline_id from the create response>"

The pipeline’s store_entity_kind will be vector (derived server-side from the destination). See Core Concepts → Pipeline for the type.

If you want to set per-pipeline options (chunk size, overlap, etc.), pass --options-json '{"chunk_size":"1000","chunk_overlap":"150"}' at create time, or update later with dodil k3 pipeline update.

4. Find the internal source

You don’t create it — it already exists. Every bucket gets an internal S3 source the moment CreateBucket runs. You only need to look up its source_id to wire a rule against it:

curl -sS "https://k3.dev.dodil.io/kb-quickstart/sources" \ -H "Authorization: Bearer $DODIL_TOKEN" \ | jq -r '.sources[] | select(.name == "internal") | .sourceId' export SOURCE_ID="<the source_id from above>"

The CLI doesn’t yet expose dodil k3 source list — that’s why we drop to curl here. See CLI Guide → source.

5. Create the rule

The rule is the trigger: “for every object in the internal source whose key matches *.pdf, run this pipeline.”

dodil k3 ingest add pdf-rule \ --bucket kb-quickstart \ --source "$SOURCE_ID" \ --collection "$PIPELINE_ID" \ --include "**/*.pdf"

The CLI’s --collection flag maps to the API’s pipeline_id field — historical naming. See dodil k3 ingest for the full surface.

6. Upload a PDF

# Grab any PDF, or use one of yours curl -sSL https://arxiv.org/pdf/1706.03762.pdf -o attention.pdf dodil k3 object create ./attention.pdf -b kb-quickstart -k papers/attention.pdf

That’s it. K3 matches your rule against the upload, spawns an IngestJob, runs the Scriptum template, and lands embeddings in your vector collection.

7. Watch the ingest job

# All jobs in the bucket — the most recent is yours dodil k3 ingest jobs --bucket kb-quickstart -o json # Or filter by pipeline dodil k3 ingest jobs --bucket kb-quickstart --pipeline-id "$PIPELINE_ID" -o json

Status progression for a successful run:

PENDING ─► PROCESSING ─► COMPLETED ▼ on transient error RETRYING ─► PROCESSING ─► COMPLETED (K3 retries automatically) ▼ on permanent error FAILED

Once COMPLETED, the job’s chunks_created and embeddings_created counters tell you how the object decomposed. See Core Concepts → Ingest job for the full status enum.

You can also inspect the object directly — pipeline_statuses on ObjectInfo reflects per-rule indexing state:

dodil k3 object show papers/attention.pdf -b kb-quickstart -o json # Look for `pipelineStatuses[]` — one entry per rule that ran

8. Search the result

The embeddings are now live in your vector collection. Query them via the Vector primitive:

dodil k3 search --bucket kb-quickstart --collection docs-index \ --query "what is multi-head attention" -o json

You should see chunks from papers/attention.pdf ranked by similarity.

What you just built

StepEntity createdProto type
3Pipelinedodil.k3.pipeline.v1.Pipeline
5Ruledodil.k3.ingest.v1.IngestRule
6 → 7Discovery event → Ingest jobIngestJob

Every subsequent PDF you PUT to kb-quickstart automatically follows the same chain. Add more rules for different extensions, MIMEs, or paths — they all dispatch in parallel, one ingest job per rule per object.

Cleanup

# Delete the rule (stops new ingests but keeps existing jobs) dodil k3 ingest delete <rule_id> --bucket kb-quickstart # Delete the pipeline (rule should already be gone or detached) dodil k3 pipeline delete "$PIPELINE_ID" --bucket kb-quickstart # Delete the object + the bucket dodil k3 object remove papers/attention.pdf -b kb-quickstart dodil k3 bucket delete kb-quickstart

The vector collection persists independently — drop it via the Vector primitive when you’re done.

Next steps

  • Recipes — full worked flows (PDF → vector, S3 → warehouse, replay & retry)
  • Core Concepts — every type signature, the event flow in detail
  • API Reference — gRPC + HTTP for Source / Pipeline / Ingest services
  • CLI Guide — every dodil k3 command in the pipeline domain