Skip to Content
We are live but in Staging 🎉
PipelinesCore Concepts

Core Concepts — Pipelines

Every type signature below is verbatim from proto/proto-k3/ (k3_source.proto, k3_pipeline.proto, k3_ingest.proto). Pillar packages are dodil.k3.source.v1, dodil.k3.pipeline.v1, dodil.k3.ingest.v1. Wire encodings: gRPC follows the proto types directly; HTTP uses pbjson (camelCase, int64 as strings, enums as wire-name strings).

The six types: Source, Credential, Pipeline, Template, Rule, IngestJob. Their lifecycle and how they compose:

bucket ──┬── Source (where to read from) │ │ │ └── Credential (how to auth) ├── Pipeline (Scriptum template + destination) │ │ │ └── Template (catalog entry) └── Rule (filter + pipeline binding) └── IngestJob (one execution per matched object)

Source

Where K3 reads objects from. Every bucket has an auto-created internal S3 source the moment you call CreateBucket — you don’t register it, name it, or authenticate it. Every direct upload to the bucket flows through your rules.

Preview — external sources (SOURCE_PROVIDER_GOOGLE_DRIVE, _SHAREPOINT, _CONFLUENCE, _GITHUB) are available in the API but are still solidifying for production use. The internal S3 path is the production-ready channel today.

{ "sourceId": "src_a1b2…", "bucket": "kb-prod", "provider": "SOURCE_PROVIDER_GOOGLE_DRIVE", "name": "finance-drive", "description": "Q2 finance docs", "rootPath": "/Q2", "providerAccount": "drive-id-123", "syncIntervalSeconds": "3600", "enabled": true, "status": "SOURCE_STATUS_ACTIVE", "totalObjects": "1247", "indexedObjects": "1240", "totalBytes": "8923471024", "createdAt": "1716840000000", "updatedAt": "1716843600000", "lastSyncAt": "1716843600000", "lastSuccessAt": "1716843600000" }

Key facts:

  • provider = SOURCE_PROVIDER_INTERNAL_S3 is the auto-created per-bucket source. It fires on every direct S3 PUT to the bucket.
  • sync_interval_seconds = 0 means manual sync only — useful when you want explicit control via TriggerDiscovery.
  • provider_account is provider-shaped: for Google Drive it’s a drive ID; for GitHub a repo URL; for InternalS3 the bucket name itself.
  • total_objects / indexed_objects / total_bytes are aggregates the source maintains as it discovers and ingests.
  • status reflects the source’s most recent sync attempt — ACTIVE is the steady-state success.

Credential

How K3 authenticates to an external source. Credentials are org-scoped and may optionally link to a source_id. The secret payload is a oneof keyed on credential_type.

Preview — credentials are used by external sources, which are Preview. The internal S3 source needs no credential. The shape below reflects the API as it stands today; expect refinements as external source support hardens.

{ "info": { "credentialId": "cred_a1b2…", "sourceId": "src_a1b2…", "provider": "SOURCE_PROVIDER_GOOGLE_DRIVE", "credentialType": "CREDENTIAL_TYPE_OAUTH2", "displayName": "gdrive-main", "isPrimary": true, "isValid": true, "createdAt": "1716840000000", "updatedAt": "1716843600000", "expiresAt": "1716847200000" }, "oauth2": { "accessToken": "ya29.…", "refreshToken": "1//…", "expiresAt": "1716847200000", "tokenType": "Bearer", "scopes": ["https://www.googleapis.com/auth/drive.readonly"] } }

The HTTP request body for StoreCredential follows the same oneof shape — exactly one of accessKey / oauth2 / apiKey / serviceAccount / pat is set.

Key facts:

  • Secrets are stored in HashiCorp Vault behind K3 — never returned in ListCredentials (only CredentialInfo). GetCredential returns the secret payload.
  • GetCredential with auto_refresh = true will refresh expired OAuth tokens before returning, so you always see a usable token.
  • is_primary is a per-(source_id, provider) flag — at most one primary per pairing. Used by sync to pick which credential to authenticate with.

Pipeline

The recipe: a Scriptum template name + per-pipeline options + an optional destination. First-class row in the K3 DB. Rules and any other triggers reference pipelines by pipeline_id.

{ "pipelineId": "pipe_a1b2…", "bucket": "kb-prod", "name": "embed-contracts", "scriptumTemplate": "text_embedding_index", "options": { "chunk_size": "1000", "chunk_overlap": "150" }, "storeEntityId": "ent_a1b2…", "storeEntityKind": "vector", "storeEntityName": "docs-index", "createdAt": "1716840000000", "updatedAt": "1716843600000" }

Key facts:

  • A pipeline’s kind is derived server-side from store_entity_idvector, warehouse, or empty/free. Clients filter by store_entity_kind on ListPipelines but route via pipeline_id.
  • Two script_source modes — reference (scriptum_template: existing Scriptum script name) or spawn (spawn_from_template: template ID, server creates a script from it and persists the resulting name on the pipeline).
  • options is an overlay on top of the Scriptum script’s defaults. Validated server-side against the template’s ScriptContract.
  • Re-bind to a different destination by UpdatePipeline with a new store_entity_id. Pass optional string cleared = make free.

Template

A row in the Scriptum template catalog. Org-scoped, not bucket-scoped. K3 ships a production catalog of 24 templates across four categories — analysis (8), embedding index/search (10), vision (4), ecommerce (2). See API Reference → Templates → The catalog for the full list with descriptions and modalities.

message Template { string id = 1; // Scriptum template ID string name = 2; string description = 3; repeated string tags = 4; repeated string tools_required = 5; map<string, string> labels = 6; // structured key-value labels string category = 7; // e.g. "embedding" | "extraction" // Typed I/O contract — proxied verbatim from Scriptum's GetTemplate. // UI form generators walk this; K3 reads it server-side to validate // caller-supplied template inputs. dodil.k3.common.v1.ScriptContract contract = 8; }

Key facts:

  • category and labels are the filter axes for ListTemplates. Common categories: embedding, extraction, structuring.
  • contract is a typed schema — ScriptContract, detailed in the next section. It tells you what fields a pipeline can override via options and what shape the outputs take.
  • For full Scriptum semantics see the Scriptum docs (separate product) — K3 surfaces only the catalog.

ScriptContract

The typed I/O contract carried on Template.contract (every pillar’s Template surfaces it). It’s K3’s anti-corruption mirror of Scriptum’s contract — re-homed into dodil.k3.common.v1 so pillar protos expose a template’s schema without importing scriptum.proto. Two consumers: form generators (UI / CLI) walk inputs to render per-field controls; the server validates caller-supplied inputs against inputs at creation time (pipeline options, AddVectorPipeline.template_inputs, table-pipeline inputs).

message ScriptContract { repeated TypedField inputs = 1; // what callers may set (validated server-side) repeated TypedField outputs = 2; // shape of what the script emits repeated TypedField stream = 3; // streamed / intermediate fields repeated TypedField env = 4; // environment inputs (keys, model handles) repeated EnumDef enums = 5; // named enums referenced by TypeRef.enum_name repeated string accepted_content_types = 6; // input MIME globs (`image/*`); empty = no restriction repeated string accepted_extensions = 7; // lowercase, no leading dot; empty = unrestricted } // One field in inputs / outputs / stream / env. message TypedField { string name = 1; TypeRef type = 2; string default_value_json = 3; // JSON literal: `"hi"`, `42`, `true`, `[1,2]`; empty = no default bool required = 4; // required if `true` OR no default_value_json string description = 5; } // A field's type — oneof over four kinds, plus constraints keyed to the kind. message TypeRef { oneof kind { TypeRefPrimitive primitive = 1; // text|number|integer|boolean|object|binary|date|timestamp|any string enum_name = 2; // names an entry in ScriptContract.enums ListType list = 3; // list<T> — recursive ObjectShape object = 4; // anonymous record of named TypedFields — recursive } NumericConstraints number_constraints = 10; // numeric kinds StringConstraints string_constraints = 11; // text kind ListConstraints list_constraints = 12; // list kind } message ListType { TypeRef item = 1; } message ObjectShape { repeated TypedField fields = 1; } // Constraints — only the message matching the field's kind is set; bounds optional (absent = unconstrained). message NumericConstraints { optional double minimum = 1; optional double maximum = 2; bool exclusive_min = 3; bool exclusive_max = 4; optional double multiple_of = 5; } message StringConstraints { optional uint32 min_length = 1; optional uint32 max_length = 2; optional string pattern = 3; } // pattern = ECMA-262 regex message ListConstraints { optional uint32 min_items = 1; optional uint32 max_items = 2; bool unique_items = 3; } // Enums — a field with kind=enum_name resolves to one of these; `base` sets wire serialization. enum EnumBase { ENUM_BASE_UNSPECIFIED = 0; ENUM_BASE_TEXT = 1; ENUM_BASE_INTEGER = 2; ENUM_BASE_NUMBER = 3; } // string | int64 | double message EnumDef { string name = 1; EnumBase base = 2; repeated EnumVariant variants = 3; } message EnumVariant { oneof value { string value_text = 1; int64 value_integer = 2; double value_number = 3; } string description = 4; } // value matches EnumDef.base

What the proto doesn’t say outright:

  • list and object nest — a TypeRef recurses to describe list<object{…}> and deeper.
  • Defaults are JSON-encoded strings, not native values — decode default_value_json as a JSON literal before use.
  • accepted_* is the modality gate — align an ingest rule’s MIME / extension filters with these, or the rule won’t match what the template can actually ingest.

Reading a contract from the CLI:

dodil k3 template get text_embedding_index -o json | jq '.contract.inputs'

Rule

The trigger: which objects in which source should fire which pipeline. Pure binding — rules don’t own destinations; they reference a pipeline that does.

{ "ruleId": "rule_a1b2…", "bucket": "kb-prod", "sourceId": "src_a1b2…", "name": "pdf-contracts", "description": "All PDFs under contracts/", "includePatterns": ["contracts/**/*.pdf"], "excludePatterns": ["contracts/**/draft-*"], "includeMimeTypes": ["application/pdf"], "excludeMimeTypes": [], "minSizeBytes": "0", "maxSizeBytes": "0", "enabled": true, "priority": 100, "createdAt": "1716840000000", "updatedAt": "1716843600000", "pipelineId": "pipe_a1b2…", "binding": { "pipelineName": "embed-contracts", "pipelineKind": "vector", "destinationId": "ent_a1b2…", "destinationKind": "vector", "destinationName": "docs-index" } }

Key facts:

  • Matching is conjunctive across filter axes: an object must satisfy include_patterns AND not match exclude_patterns AND satisfy MIME include/exclude AND be within [min_size_bytes, max_size_bytes]. Each axis is OR’d within itself (any include pattern matching is enough). Excludes are checked before includes — if an exclude matches, K3 rejects the object without considering includes.
  • priority is int32, higher values evaluated first. Use it to express “specific rules before general ones.”
  • Multiple rules can match the same object — each match spawns its own IngestJob. Use enabled = false to pause a rule without deleting it.
  • binding is display-only — server-resolved from pipeline_id. Don’t make routing decisions from it; if you delete a pipeline, dependent rules still exist with pipeline_id pointing at nothing and binding empty.

Glob pattern syntax

include_patterns and exclude_patterns use K3’s in-house glob matcher with these tokens:

TokenMeaningExample
*any chars except / (single path segment)*.pdf matches report.pdf, NOT dir/report.pdf
**any chars including / (zero or more segments)**/*.pdf matches both report.pdf AND a/b/report.pdf
?exactly one characterfile?.txt matches file1.txt, NOT file12.txt
literalmatch exactlyreadme.md matches the key readme.md

Behavior:

  • Case-insensitive (ASCII). **/*.pdf matches Report.PDF, IMG_1234.JPG matches **/*.jpg, README.MD matches readme.md. The matcher lowercases both pattern and key before comparing — same convention as other production glob implementations (globset, fast-glob, minimatch). Bucket keys + extensions are always ASCII so K3 uses cheap ASCII case-folding, not full Unicode.
  • ** matches zero or more path segments**/test.txt matches both test.txt (at the root) AND a/b/c/test.txt (nested). Trailing / after ** is optional.

Not supported (yet — work around with multiple patterns):

  • POSIX character classes[a-z], [!a] don’t expand
  • Brace expansion{pdf,docx} is treated literally
  • Escaping — there’s no way to match a literal *, ?, or ** in the object key itself

For either-or expansion, use multiple include patterns:

dodil k3 ingest add multi-ext -b kb-prod \ --source "$SOURCE_ID" --collection "$PIPELINE_ID" \ --include "**/*.pdf" --include "**/*.docx" --include "**/*.txt"

MIME pattern syntax

include_mime_types and exclude_mime_types use a simpler matcher than globs:

Pattern shapeMatches
Exact MIME (text/plain)only that exact MIME
Subtype wildcard (text/*)any MIME with that type — text/plain, text/html, text/csv, …

Not supported: type wildcards (*/json doesn’t work), brace expansion (text/{plain,html}), or any other shape. List multiple exact / subtype-wildcard entries instead.

Evaluation order (worth knowing for debugging)

K3 evaluates a rule against an object in this exact order, short-circuiting on the first reject:

  1. exclude_patterns — if any match, reject (early-exit; cheap)
  2. include_patterns — if non-empty and none match, reject
  3. exclude_mime_types — if any match, reject
  4. include_mime_types — if non-empty and none match, reject
  5. max_size_bytes (when > 0) — if object exceeds, reject
  6. min_size_bytes (when > 0) — if object is smaller, reject

If all axes pass, the object matches.

IngestJob

One execution of one pipeline against one object. The unit of observability for everything that happens after a rule matches.

{ "jobId": "job_a1b2…", "object": { "bucket": "kb-prod", "key": "contracts/acme-2026.pdf" }, "status": "INGEST_STATUS_COMPLETED", "pipelineId": "pipe_a1b2…", "pipelineName": "embed-contracts", "pipelineKind": "vector", "ruleId": "rule_a1b2…", "chunksCreated": 47, "embeddingsCreated": 47, "rowsWritten": 0, "createdAt": "1716843600000", "updatedAt": "1716843680000", "threadId": "thread_a1b2…", "vectorStatus": "success", "outputSize": 18429, "batchesReceived": 5, "embeddingsWritten": 47 }

Key facts:

  • status transitions are not always linear — transient failures retry automatically through RETRYING → PROCESSING → COMPLETED, or land in FAILED after the final attempt.
  • rule_id is absent for jobs spawned via TriggerIngest (single-object manual trigger) without a rule_id override.
  • pipeline_kind controls which counters populate: vectorchunks_created / embeddings_created / embeddings_written; warehouserows_written; free → none.
  • thread_id is the Scriptum thread ID — useful for cross-referencing with Scriptum logs / observability.
  • vector_status is a fine-grained sub-status for vector pipelines specifically: did the embedding-persist phase succeed (success), fail (failed), or skip because the script produced no embeddings (skipped)?
  • batches_received vs embeddings_written lets you spot partial writes: Scriptum yielded N batches but only M embeddings landed in the collection.

When ingest runs

Three triggers can cause a pipeline to run against an object — all produce the same IngestJob shape, only rule_id differs:

TriggerWhenrule_id on job
Direct uploadSuccessful S3 PUT (or CompleteMultipartUpload) on the bucket’s byte planeSet — every rule that matches the object fires
Source syncTriggerDiscovery (manual) or the source’s sync_interval_seconds tickSet — same matching logic
One-shot triggerTriggerIngest on a single ObjectRefOptional — pass rule_id / pipeline_id explicitly or let server resolve

Reliability:

  • Transient failures retry automatically up to a bounded attempt count (the INGEST_STATUS_RETRYING state).
  • After max attempts the job becomes INGEST_STATUS_FAILED with the last error in error / error_details.
  • Replay any failed or partial run by re-triggering with TriggerIngestion on the source (with retry_failed = true) or TriggerIngest on a single object.

See also

  • Quickstart — wire all six entities together in 5 minutes
  • API Reference — gRPC + HTTP for every Source / Pipeline / Ingest RPC
  • CLI Guidedodil k3 source · pipeline · template · ingest · credential
  • Vector — destination kind for vector pipelines
  • Tables — destination kind for warehouse pipelines
  • Conventions — auth headers, error envelope, wire encoding rules