Core Concepts — Pipelines

Every type signature below is verbatim from proto/proto-k3/ (k3_source.proto, k3_pipeline.proto, k3_ingest.proto). Pillar packages are dodil.k3.source.v1, dodil.k3.pipeline.v1, dodil.k3.ingest.v1. Wire encodings: gRPC follows the proto types directly; HTTP uses pbjson (camelCase, int64 as strings, enums as wire-name strings).

The six types: Source, Credential, Pipeline, Template, Rule, IngestJob. Their lifecycle and how they compose:


   bucket ──┬── Source (where to read from)
            │     │
            │     └── Credential (how to auth)
            │
            ├── Pipeline (Scriptum template + destination)
            │     │
            │     └── Template (catalog entry)
            │
            └── Rule (filter + pipeline binding)
                  │
                  └── IngestJob (one execution per matched object)

`Source`

Where K3 reads objects from. Every bucket has an auto-created internal S3 source the moment you call CreateBucket — you don’t register it, name it, or authenticate it. Every direct upload to the bucket flows through your rules.

Preview — external sources (SOURCE_PROVIDER_GOOGLE_DRIVE, _SHAREPOINT, _CONFLUENCE, _GITHUB) are available in the API but are still solidifying for production use. The internal S3 path is the production-ready channel today.

HTTP


{
  "sourceId": "src_a1b2…",
  "bucket": "kb-prod",
  "provider": "SOURCE_PROVIDER_GOOGLE_DRIVE",
  "name": "finance-drive",
  "description": "Q2 finance docs",
  "rootPath": "/Q2",
  "providerAccount": "drive-id-123",
  "syncIntervalSeconds": "3600",
  "enabled": true,
  "status": "SOURCE_STATUS_ACTIVE",
  "totalObjects": "1247",
  "indexedObjects": "1240",
  "totalBytes": "8923471024",
  "createdAt": "1716840000000",
  "updatedAt": "1716843600000",
  "lastSyncAt": "1716843600000",
  "lastSuccessAt": "1716843600000"
}

Key facts:

provider = SOURCE_PROVIDER_INTERNAL_S3 is the auto-created per-bucket source. It fires on every direct S3 PUT to the bucket.
sync_interval_seconds = 0 means manual sync only — useful when you want explicit control via TriggerDiscovery.
provider_account is provider-shaped: for Google Drive it’s a drive ID; for GitHub a repo URL; for InternalS3 the bucket name itself.
total_objects / indexed_objects / total_bytes are aggregates the source maintains as it discovers and ingests.
status reflects the source’s most recent sync attempt — ACTIVE is the steady-state success.

`Credential`

How K3 authenticates to an external source. Credentials are org-scoped and may optionally link to a source_id. The secret payload is a oneof keyed on credential_type.

Preview — credentials are used by external sources, which are Preview. The internal S3 source needs no credential. The shape below reflects the API as it stands today; expect refinements as external source support hardens.

HTTP


{
  "info": {
    "credentialId": "cred_a1b2…",
    "sourceId": "src_a1b2…",
    "provider": "SOURCE_PROVIDER_GOOGLE_DRIVE",
    "credentialType": "CREDENTIAL_TYPE_OAUTH2",
    "displayName": "gdrive-main",
    "isPrimary": true,
    "isValid": true,
    "createdAt": "1716840000000",
    "updatedAt": "1716843600000",
    "expiresAt": "1716847200000"
  },
  "oauth2": {
    "accessToken": "ya29.…",
    "refreshToken": "1//…",
    "expiresAt": "1716847200000",
    "tokenType": "Bearer",
    "scopes": ["https://www.googleapis.com/auth/drive.readonly"]
  }
}

The HTTP request body for StoreCredential follows the same oneof shape — exactly one of accessKey / oauth2 / apiKey / serviceAccount / pat is set.

Key facts:

Secrets are stored in HashiCorp Vault behind K3 — never returned in ListCredentials (only CredentialInfo). GetCredential returns the secret payload.
GetCredential with auto_refresh = true will refresh expired OAuth tokens before returning, so you always see a usable token.
is_primary is a per-(source_id, provider) flag — at most one primary per pairing. Used by sync to pick which credential to authenticate with.

`Pipeline`

The recipe: a Scriptum template name + per-pipeline options + an optional destination. First-class row in the K3 DB. Rules and any other triggers reference pipelines by pipeline_id.

HTTP


{
  "pipelineId": "pipe_a1b2…",
  "bucket": "kb-prod",
  "name": "embed-contracts",
  "scriptumTemplate": "text_embedding_index",
  "options": {
    "chunk_size": "1000",
    "chunk_overlap": "150"
  },
  "storeEntityId": "ent_a1b2…",
  "storeEntityKind": "vector",
  "storeEntityName": "docs-index",
  "createdAt": "1716840000000",
  "updatedAt": "1716843600000"
}

Key facts:

A pipeline’s kind is derived server-side from store_entity_id — vector, warehouse, or empty/free. Clients filter by store_entity_kind on ListPipelines but route via pipeline_id.
Two script_source modes — reference (scriptum_template: existing Scriptum script name) or spawn (spawn_from_template: template ID, server creates a script from it and persists the resulting name on the pipeline).
options is an overlay on top of the Scriptum script’s defaults. Validated server-side against the template’s ScriptContract.
Re-bind to a different destination by UpdatePipeline with a new store_entity_id. Pass optional string cleared = make free.

`Template`

A row in the Scriptum template catalog. Org-scoped, not bucket-scoped. K3 ships a production catalog of 24 templates across four categories — analysis (8), embedding index/search (10), vision (4), ecommerce (2). See API Reference → Templates → The catalog for the full list with descriptions and modalities.


message Template {
  string id = 1;                              // Scriptum template ID
  string name = 2;
  string description = 3;
  repeated string tags = 4;
  repeated string tools_required = 5;
  map<string, string> labels = 6;             // structured key-value labels
  string category = 7;                        // e.g. "embedding" | "extraction"
  // Typed I/O contract — proxied verbatim from Scriptum's GetTemplate.
  // UI form generators walk this; K3 reads it server-side to validate
  // caller-supplied template inputs.
  dodil.k3.common.v1.ScriptContract contract = 8;
}

Key facts:

category and labels are the filter axes for ListTemplates. Common categories: embedding, extraction, structuring.
contract is a typed schema — ScriptContract, detailed in the next section. It tells you what fields a pipeline can override via options and what shape the outputs take.
For full Scriptum semantics see the Scriptum docs (separate product) — K3 surfaces only the catalog.

`ScriptContract`

The typed I/O contract carried on Template.contract (every pillar’s Template surfaces it). It’s K3’s anti-corruption mirror of Scriptum’s contract — re-homed into dodil.k3.common.v1 so pillar protos expose a template’s schema without importing scriptum.proto. Two consumers: form generators (UI / CLI) walk inputs to render per-field controls; the server validates caller-supplied inputs against inputs at creation time (pipeline options, AddVectorPipeline.template_inputs, table-pipeline inputs).


message ScriptContract {
  repeated TypedField inputs  = 1;              // what callers may set (validated server-side)
  repeated TypedField outputs = 2;              // shape of what the script emits
  repeated TypedField stream  = 3;              // streamed / intermediate fields
  repeated TypedField env     = 4;              // environment inputs (keys, model handles)
  repeated EnumDef    enums   = 5;              // named enums referenced by TypeRef.enum_name
  repeated string accepted_content_types = 6;   // input MIME globs (`image/*`); empty = no restriction
  repeated string accepted_extensions    = 7;   // lowercase, no leading dot; empty = unrestricted
}
 
// One field in inputs / outputs / stream / env.
message TypedField {
  string  name               = 1;
  TypeRef type               = 2;
  string  default_value_json = 3;   // JSON literal: `"hi"`, `42`, `true`, `[1,2]`; empty = no default
  bool    required           = 4;   // required if `true` OR no default_value_json
  string  description        = 5;
}
 
// A field's type — oneof over four kinds, plus constraints keyed to the kind.
message TypeRef {
  oneof kind {
    TypeRefPrimitive primitive = 1;   // text|number|integer|boolean|object|binary|date|timestamp|any
    string           enum_name = 2;   // names an entry in ScriptContract.enums
    ListType         list      = 3;   // list<T> — recursive
    ObjectShape      object    = 4;   // anonymous record of named TypedFields — recursive
  }
  NumericConstraints number_constraints = 10;   // numeric kinds
  StringConstraints  string_constraints = 11;   // text kind
  ListConstraints    list_constraints   = 12;   // list kind
}
message ListType    { TypeRef item = 1; }
message ObjectShape { repeated TypedField fields = 1; }
 
// Constraints — only the message matching the field's kind is set; bounds optional (absent = unconstrained).
message NumericConstraints { optional double minimum = 1; optional double maximum = 2; bool exclusive_min = 3; bool exclusive_max = 4; optional double multiple_of = 5; }
message StringConstraints  { optional uint32 min_length = 1; optional uint32 max_length = 2; optional string pattern = 3; }  // pattern = ECMA-262 regex
message ListConstraints    { optional uint32 min_items = 1; optional uint32 max_items = 2; bool unique_items = 3; }
 
// Enums — a field with kind=enum_name resolves to one of these; `base` sets wire serialization.
enum EnumBase   { ENUM_BASE_UNSPECIFIED = 0; ENUM_BASE_TEXT = 1; ENUM_BASE_INTEGER = 2; ENUM_BASE_NUMBER = 3; }  // string | int64 | double
message EnumDef     { string name = 1; EnumBase base = 2; repeated EnumVariant variants = 3; }
message EnumVariant { oneof value { string value_text = 1; int64 value_integer = 2; double value_number = 3; } string description = 4; }  // value matches EnumDef.base

What the proto doesn’t say outright:

list and object nest — a TypeRef recurses to describe list<object{…}> and deeper.
Defaults are JSON-encoded strings, not native values — decode default_value_json as a JSON literal before use.
accepted_* is the modality gate — align an ingest rule’s MIME / extension filters with these, or the rule won’t match what the template can actually ingest.

Reading a contract from the CLI:


dodil k3 template get text_embedding_index -o json | jq '.contract.inputs'

`Rule`

The trigger: which objects in which source should fire which pipeline. Pure binding — rules don’t own destinations; they reference a pipeline that does.

HTTP


{
  "ruleId": "rule_a1b2…",
  "bucket": "kb-prod",
  "sourceId": "src_a1b2…",
  "name": "pdf-contracts",
  "description": "All PDFs under contracts/",
  "includePatterns": ["contracts/**/*.pdf"],
  "excludePatterns": ["contracts/**/draft-*"],
  "includeMimeTypes": ["application/pdf"],
  "excludeMimeTypes": [],
  "minSizeBytes": "0",
  "maxSizeBytes": "0",
  "enabled": true,
  "priority": 100,
  "createdAt": "1716840000000",
  "updatedAt": "1716843600000",
  "pipelineId": "pipe_a1b2…",
  "binding": {
    "pipelineName": "embed-contracts",
    "pipelineKind": "vector",
    "destinationId": "ent_a1b2…",
    "destinationKind": "vector",
    "destinationName": "docs-index"
  }
}

Key facts:

Matching is conjunctive across filter axes: an object must satisfy include_patterns AND not match exclude_patterns AND satisfy MIME include/exclude AND be within [min_size_bytes, max_size_bytes]. Each axis is OR’d within itself (any include pattern matching is enough). Excludes are checked before includes — if an exclude matches, K3 rejects the object without considering includes.
priority is int32, higher values evaluated first. Use it to express “specific rules before general ones.”
Multiple rules can match the same object — each match spawns its own IngestJob. Use enabled = false to pause a rule without deleting it.
binding is display-only — server-resolved from pipeline_id. Don’t make routing decisions from it; if you delete a pipeline, dependent rules still exist with pipeline_id pointing at nothing and binding empty.

Glob pattern syntax

include_patterns and exclude_patterns use K3’s in-house glob matcher with these tokens:

Token	Meaning	Example
`*`	any chars except `/` (single path segment)	`*.pdf` matches `report.pdf`, NOT `dir/report.pdf`
`**`	any chars including `/` (zero or more segments)	`*/.pdf` matches both `report.pdf` AND `a/b/report.pdf`
`?`	exactly one character	`file?.txt` matches `file1.txt`, NOT `file12.txt`
literal	match exactly	`readme.md` matches the key `readme.md`

Behavior:

Case-insensitive (ASCII). **/*.pdf matches Report.PDF, IMG_1234.JPG matches **/*.jpg, README.MD matches readme.md. The matcher lowercases both pattern and key before comparing — same convention as other production glob implementations (globset, fast-glob, minimatch). Bucket keys + extensions are always ASCII so K3 uses cheap ASCII case-folding, not full Unicode.
** matches zero or more path segments — **/test.txt matches both test.txt (at the root) AND a/b/c/test.txt (nested). Trailing / after ** is optional.

Not supported (yet — work around with multiple patterns):

POSIX character classes — [a-z], [!a] don’t expand
Brace expansion — {pdf,docx} is treated literally
Escaping — there’s no way to match a literal *, ?, or ** in the object key itself

For either-or expansion, use multiple include patterns:


dodil k3 ingest add multi-ext -b kb-prod \
  --source "$SOURCE_ID" --collection "$PIPELINE_ID" \
  --include "**/*.pdf" --include "**/*.docx" --include "**/*.txt"

MIME pattern syntax

include_mime_types and exclude_mime_types use a simpler matcher than globs:

Pattern shape	Matches
Exact MIME (`text/plain`)	only that exact MIME
Subtype wildcard (`text/*`)	any MIME with that type — `text/plain`, `text/html`, `text/csv`, …

Not supported: type wildcards (*/json doesn’t work), brace expansion (text/{plain,html}), or any other shape. List multiple exact / subtype-wildcard entries instead.

Evaluation order (worth knowing for debugging)

K3 evaluates a rule against an object in this exact order, short-circuiting on the first reject:

exclude_patterns — if any match, reject (early-exit; cheap)
include_patterns — if non-empty and none match, reject
exclude_mime_types — if any match, reject
include_mime_types — if non-empty and none match, reject
max_size_bytes (when > 0) — if object exceeds, reject
min_size_bytes (when > 0) — if object is smaller, reject

If all axes pass, the object matches.

`IngestJob`

One execution of one pipeline against one object. The unit of observability for everything that happens after a rule matches.

HTTP


{
  "jobId": "job_a1b2…",
  "object": { "bucket": "kb-prod", "key": "contracts/acme-2026.pdf" },
  "status": "INGEST_STATUS_COMPLETED",
  "pipelineId": "pipe_a1b2…",
  "pipelineName": "embed-contracts",
  "pipelineKind": "vector",
  "ruleId": "rule_a1b2…",
  "chunksCreated": 47,
  "embeddingsCreated": 47,
  "rowsWritten": 0,
  "createdAt": "1716843600000",
  "updatedAt": "1716843680000",
  "threadId": "thread_a1b2…",
  "vectorStatus": "success",
  "outputSize": 18429,
  "batchesReceived": 5,
  "embeddingsWritten": 47
}

Key facts:

status transitions are not always linear — transient failures retry automatically through RETRYING → PROCESSING → COMPLETED, or land in FAILED after the final attempt.
rule_id is absent for jobs spawned via TriggerIngest (single-object manual trigger) without a rule_id override.
pipeline_kind controls which counters populate: vector → chunks_created / embeddings_created / embeddings_written; warehouse → rows_written; free → none.
thread_id is the Scriptum thread ID — useful for cross-referencing with Scriptum logs / observability.
vector_status is a fine-grained sub-status for vector pipelines specifically: did the embedding-persist phase succeed (success), fail (failed), or skip because the script produced no embeddings (skipped)?
batches_received vs embeddings_written lets you spot partial writes: Scriptum yielded N batches but only M embeddings landed in the collection.

When ingest runs

Three triggers can cause a pipeline to run against an object — all produce the same IngestJob shape, only rule_id differs:

Trigger	When	`rule_id` on job
Direct upload	Successful S3 `PUT` (or `CompleteMultipartUpload`) on the bucket’s byte plane	Set — every rule that matches the object fires
Source sync	`TriggerDiscovery` (manual) or the source’s `sync_interval_seconds` tick	Set — same matching logic
One-shot trigger	`TriggerIngest` on a single `ObjectRef`	Optional — pass `rule_id` / `pipeline_id` explicitly or let server resolve

Reliability:

Transient failures retry automatically up to a bounded attempt count (the INGEST_STATUS_RETRYING state).
After max attempts the job becomes INGEST_STATUS_FAILED with the last error in error / error_details.
Replay any failed or partial run by re-triggering with TriggerIngestion on the source (with retry_failed = true) or TriggerIngest on a single object.

Core Concepts — Pipelines

Source

HTTP

gRPC

Credential

HTTP

gRPC

Pipeline

HTTP

gRPC

Template

ScriptContract

Rule

HTTP

gRPC

Glob pattern syntax

MIME pattern syntax

Evaluation order (worth knowing for debugging)

IngestJob

HTTP

gRPC

When ingest runs

See also

`Source`

`Credential`

`Pipeline`

`Template`

`ScriptContract`

`Rule`

`IngestJob`