Core Concepts — Pipelines
Every type signature below is verbatim from proto/proto-k3/ (k3_source.proto, k3_pipeline.proto, k3_ingest.proto). Pillar packages are dodil.k3.source.v1, dodil.k3.pipeline.v1, dodil.k3.ingest.v1. Wire encodings: gRPC follows the proto types directly; HTTP uses pbjson (camelCase, int64 as strings, enums as wire-name strings).
The six types: Source, Credential, Pipeline, Template, Rule, IngestJob. Their lifecycle and how they compose:
bucket ──┬── Source (where to read from)
│ │
│ └── Credential (how to auth)
│
├── Pipeline (Scriptum template + destination)
│ │
│ └── Template (catalog entry)
│
└── Rule (filter + pipeline binding)
│
└── IngestJob (one execution per matched object)Source
Where K3 reads objects from. Every bucket has an auto-created internal S3 source the moment you call CreateBucket — you don’t register it, name it, or authenticate it. Every direct upload to the bucket flows through your rules.
Preview — external sources (
SOURCE_PROVIDER_GOOGLE_DRIVE,_SHAREPOINT,_CONFLUENCE,_GITHUB) are available in the API but are still solidifying for production use. The internal S3 path is the production-ready channel today.
HTTP
{
"sourceId": "src_a1b2…",
"bucket": "kb-prod",
"provider": "SOURCE_PROVIDER_GOOGLE_DRIVE",
"name": "finance-drive",
"description": "Q2 finance docs",
"rootPath": "/Q2",
"providerAccount": "drive-id-123",
"syncIntervalSeconds": "3600",
"enabled": true,
"status": "SOURCE_STATUS_ACTIVE",
"totalObjects": "1247",
"indexedObjects": "1240",
"totalBytes": "8923471024",
"createdAt": "1716840000000",
"updatedAt": "1716843600000",
"lastSyncAt": "1716843600000",
"lastSuccessAt": "1716843600000"
}Key facts:
provider = SOURCE_PROVIDER_INTERNAL_S3is the auto-created per-bucket source. It fires on every direct S3 PUT to the bucket.sync_interval_seconds = 0means manual sync only — useful when you want explicit control viaTriggerDiscovery.provider_accountis provider-shaped: for Google Drive it’s a drive ID; for GitHub a repo URL; for InternalS3 the bucket name itself.total_objects/indexed_objects/total_bytesare aggregates the source maintains as it discovers and ingests.statusreflects the source’s most recent sync attempt —ACTIVEis the steady-state success.
Credential
How K3 authenticates to an external source. Credentials are org-scoped and may optionally link to a source_id. The secret payload is a oneof keyed on credential_type.
Preview — credentials are used by external sources, which are Preview. The internal S3 source needs no credential. The shape below reflects the API as it stands today; expect refinements as external source support hardens.
HTTP
{
"info": {
"credentialId": "cred_a1b2…",
"sourceId": "src_a1b2…",
"provider": "SOURCE_PROVIDER_GOOGLE_DRIVE",
"credentialType": "CREDENTIAL_TYPE_OAUTH2",
"displayName": "gdrive-main",
"isPrimary": true,
"isValid": true,
"createdAt": "1716840000000",
"updatedAt": "1716843600000",
"expiresAt": "1716847200000"
},
"oauth2": {
"accessToken": "ya29.…",
"refreshToken": "1//…",
"expiresAt": "1716847200000",
"tokenType": "Bearer",
"scopes": ["https://www.googleapis.com/auth/drive.readonly"]
}
}The HTTP request body for StoreCredential follows the same oneof shape — exactly one of accessKey / oauth2 / apiKey / serviceAccount / pat is set.
Key facts:
- Secrets are stored in HashiCorp Vault behind K3 — never returned in
ListCredentials(onlyCredentialInfo).GetCredentialreturns the secret payload. GetCredentialwithauto_refresh = truewill refresh expired OAuth tokens before returning, so you always see a usable token.is_primaryis a per-(source_id, provider)flag — at most one primary per pairing. Used by sync to pick which credential to authenticate with.
Pipeline
The recipe: a Scriptum template name + per-pipeline options + an optional destination. First-class row in the K3 DB. Rules and any other triggers reference pipelines by pipeline_id.
HTTP
{
"pipelineId": "pipe_a1b2…",
"bucket": "kb-prod",
"name": "embed-contracts",
"scriptumTemplate": "text_embedding_index",
"options": {
"chunk_size": "1000",
"chunk_overlap": "150"
},
"storeEntityId": "ent_a1b2…",
"storeEntityKind": "vector",
"storeEntityName": "docs-index",
"createdAt": "1716840000000",
"updatedAt": "1716843600000"
}Key facts:
- A pipeline’s kind is derived server-side from
store_entity_id—vector,warehouse, or empty/free. Clients filter bystore_entity_kindonListPipelinesbut route viapipeline_id. - Two
script_sourcemodes — reference (scriptum_template: existing Scriptum script name) or spawn (spawn_from_template: template ID, server creates a script from it and persists the resulting name on the pipeline). optionsis an overlay on top of the Scriptum script’s defaults. Validated server-side against the template’sScriptContract.- Re-bind to a different destination by
UpdatePipelinewith a newstore_entity_id. Passoptional stringcleared = make free.
Template
A row in the Scriptum template catalog. Org-scoped, not bucket-scoped. K3 ships a production catalog of 24 templates across four categories — analysis (8), embedding index/search (10), vision (4), ecommerce (2). See API Reference → Templates → The catalog for the full list with descriptions and modalities.
message Template {
string id = 1; // Scriptum template ID
string name = 2;
string description = 3;
repeated string tags = 4;
repeated string tools_required = 5;
map<string, string> labels = 6; // structured key-value labels
string category = 7; // e.g. "embedding" | "extraction"
// Typed I/O contract — proxied verbatim from Scriptum's GetTemplate.
// UI form generators walk this; K3 reads it server-side to validate
// caller-supplied template inputs.
dodil.k3.common.v1.ScriptContract contract = 8;
}Key facts:
categoryandlabelsare the filter axes forListTemplates. Common categories:embedding,extraction,structuring.contractis a typed schema —ScriptContract, detailed in the next section. It tells you what fields a pipeline can override viaoptionsand what shape the outputs take.- For full Scriptum semantics see the Scriptum docs (separate product) — K3 surfaces only the catalog.
ScriptContract
The typed I/O contract carried on Template.contract (every pillar’s Template surfaces it). It’s K3’s anti-corruption mirror of Scriptum’s contract — re-homed into dodil.k3.common.v1 so pillar protos expose a template’s schema without importing scriptum.proto. Two consumers: form generators (UI / CLI) walk inputs to render per-field controls; the server validates caller-supplied inputs against inputs at creation time (pipeline options, AddVectorPipeline.template_inputs, table-pipeline inputs).
message ScriptContract {
repeated TypedField inputs = 1; // what callers may set (validated server-side)
repeated TypedField outputs = 2; // shape of what the script emits
repeated TypedField stream = 3; // streamed / intermediate fields
repeated TypedField env = 4; // environment inputs (keys, model handles)
repeated EnumDef enums = 5; // named enums referenced by TypeRef.enum_name
repeated string accepted_content_types = 6; // input MIME globs (`image/*`); empty = no restriction
repeated string accepted_extensions = 7; // lowercase, no leading dot; empty = unrestricted
}
// One field in inputs / outputs / stream / env.
message TypedField {
string name = 1;
TypeRef type = 2;
string default_value_json = 3; // JSON literal: `"hi"`, `42`, `true`, `[1,2]`; empty = no default
bool required = 4; // required if `true` OR no default_value_json
string description = 5;
}
// A field's type — oneof over four kinds, plus constraints keyed to the kind.
message TypeRef {
oneof kind {
TypeRefPrimitive primitive = 1; // text|number|integer|boolean|object|binary|date|timestamp|any
string enum_name = 2; // names an entry in ScriptContract.enums
ListType list = 3; // list<T> — recursive
ObjectShape object = 4; // anonymous record of named TypedFields — recursive
}
NumericConstraints number_constraints = 10; // numeric kinds
StringConstraints string_constraints = 11; // text kind
ListConstraints list_constraints = 12; // list kind
}
message ListType { TypeRef item = 1; }
message ObjectShape { repeated TypedField fields = 1; }
// Constraints — only the message matching the field's kind is set; bounds optional (absent = unconstrained).
message NumericConstraints { optional double minimum = 1; optional double maximum = 2; bool exclusive_min = 3; bool exclusive_max = 4; optional double multiple_of = 5; }
message StringConstraints { optional uint32 min_length = 1; optional uint32 max_length = 2; optional string pattern = 3; } // pattern = ECMA-262 regex
message ListConstraints { optional uint32 min_items = 1; optional uint32 max_items = 2; bool unique_items = 3; }
// Enums — a field with kind=enum_name resolves to one of these; `base` sets wire serialization.
enum EnumBase { ENUM_BASE_UNSPECIFIED = 0; ENUM_BASE_TEXT = 1; ENUM_BASE_INTEGER = 2; ENUM_BASE_NUMBER = 3; } // string | int64 | double
message EnumDef { string name = 1; EnumBase base = 2; repeated EnumVariant variants = 3; }
message EnumVariant { oneof value { string value_text = 1; int64 value_integer = 2; double value_number = 3; } string description = 4; } // value matches EnumDef.baseWhat the proto doesn’t say outright:
listandobjectnest — aTypeRefrecurses to describelist<object{…}>and deeper.- Defaults are JSON-encoded strings, not native values — decode
default_value_jsonas a JSON literal before use. accepted_*is the modality gate — align an ingest rule’s MIME / extension filters with these, or the rule won’t match what the template can actually ingest.
Reading a contract from the CLI:
dodil k3 template get text_embedding_index -o json | jq '.contract.inputs'Rule
The trigger: which objects in which source should fire which pipeline. Pure binding — rules don’t own destinations; they reference a pipeline that does.
HTTP
{
"ruleId": "rule_a1b2…",
"bucket": "kb-prod",
"sourceId": "src_a1b2…",
"name": "pdf-contracts",
"description": "All PDFs under contracts/",
"includePatterns": ["contracts/**/*.pdf"],
"excludePatterns": ["contracts/**/draft-*"],
"includeMimeTypes": ["application/pdf"],
"excludeMimeTypes": [],
"minSizeBytes": "0",
"maxSizeBytes": "0",
"enabled": true,
"priority": 100,
"createdAt": "1716840000000",
"updatedAt": "1716843600000",
"pipelineId": "pipe_a1b2…",
"binding": {
"pipelineName": "embed-contracts",
"pipelineKind": "vector",
"destinationId": "ent_a1b2…",
"destinationKind": "vector",
"destinationName": "docs-index"
}
}Key facts:
- Matching is conjunctive across filter axes: an object must satisfy
include_patternsAND not matchexclude_patternsAND satisfy MIME include/exclude AND be within[min_size_bytes, max_size_bytes]. Each axis is OR’d within itself (any include pattern matching is enough). Excludes are checked before includes — if an exclude matches, K3 rejects the object without considering includes. priorityisint32, higher values evaluated first. Use it to express “specific rules before general ones.”- Multiple rules can match the same object — each match spawns its own
IngestJob. Useenabled = falseto pause a rule without deleting it. bindingis display-only — server-resolved frompipeline_id. Don’t make routing decisions from it; if you delete a pipeline, dependent rules still exist withpipeline_idpointing at nothing andbindingempty.
Glob pattern syntax
include_patterns and exclude_patterns use K3’s in-house glob matcher with these tokens:
| Token | Meaning | Example |
|---|---|---|
* | any chars except / (single path segment) | *.pdf matches report.pdf, NOT dir/report.pdf |
** | any chars including / (zero or more segments) | **/*.pdf matches both report.pdf AND a/b/report.pdf |
? | exactly one character | file?.txt matches file1.txt, NOT file12.txt |
| literal | match exactly | readme.md matches the key readme.md |
Behavior:
- Case-insensitive (ASCII).
**/*.pdfmatchesReport.PDF,IMG_1234.JPGmatches**/*.jpg,README.MDmatchesreadme.md. The matcher lowercases both pattern and key before comparing — same convention as other production glob implementations (globset, fast-glob, minimatch). Bucket keys + extensions are always ASCII so K3 uses cheap ASCII case-folding, not full Unicode. **matches zero or more path segments —**/test.txtmatches bothtest.txt(at the root) ANDa/b/c/test.txt(nested). Trailing/after**is optional.
Not supported (yet — work around with multiple patterns):
- POSIX character classes —
[a-z],[!a]don’t expand - Brace expansion —
{pdf,docx}is treated literally - Escaping — there’s no way to match a literal
*,?, or**in the object key itself
For either-or expansion, use multiple include patterns:
dodil k3 ingest add multi-ext -b kb-prod \
--source "$SOURCE_ID" --collection "$PIPELINE_ID" \
--include "**/*.pdf" --include "**/*.docx" --include "**/*.txt"MIME pattern syntax
include_mime_types and exclude_mime_types use a simpler matcher than globs:
| Pattern shape | Matches |
|---|---|
Exact MIME (text/plain) | only that exact MIME |
Subtype wildcard (text/*) | any MIME with that type — text/plain, text/html, text/csv, … |
Not supported: type wildcards (*/json doesn’t work), brace expansion (text/{plain,html}), or any other shape. List multiple exact / subtype-wildcard entries instead.
Evaluation order (worth knowing for debugging)
K3 evaluates a rule against an object in this exact order, short-circuiting on the first reject:
- exclude_patterns — if any match, reject (early-exit; cheap)
- include_patterns — if non-empty and none match, reject
- exclude_mime_types — if any match, reject
- include_mime_types — if non-empty and none match, reject
- max_size_bytes (when
> 0) — if object exceeds, reject - min_size_bytes (when
> 0) — if object is smaller, reject
If all axes pass, the object matches.
IngestJob
One execution of one pipeline against one object. The unit of observability for everything that happens after a rule matches.
HTTP
{
"jobId": "job_a1b2…",
"object": { "bucket": "kb-prod", "key": "contracts/acme-2026.pdf" },
"status": "INGEST_STATUS_COMPLETED",
"pipelineId": "pipe_a1b2…",
"pipelineName": "embed-contracts",
"pipelineKind": "vector",
"ruleId": "rule_a1b2…",
"chunksCreated": 47,
"embeddingsCreated": 47,
"rowsWritten": 0,
"createdAt": "1716843600000",
"updatedAt": "1716843680000",
"threadId": "thread_a1b2…",
"vectorStatus": "success",
"outputSize": 18429,
"batchesReceived": 5,
"embeddingsWritten": 47
}Key facts:
statustransitions are not always linear — transient failures retry automatically throughRETRYING → PROCESSING → COMPLETED, or land inFAILEDafter the final attempt.rule_idis absent for jobs spawned viaTriggerIngest(single-object manual trigger) without arule_idoverride.pipeline_kindcontrols which counters populate:vector→chunks_created/embeddings_created/embeddings_written;warehouse→rows_written;free→ none.thread_idis the Scriptum thread ID — useful for cross-referencing with Scriptum logs / observability.vector_statusis a fine-grained sub-status for vector pipelines specifically: did the embedding-persist phase succeed (success), fail (failed), or skip because the script produced no embeddings (skipped)?batches_receivedvsembeddings_writtenlets you spot partial writes: Scriptum yielded N batches but only M embeddings landed in the collection.
When ingest runs
Three triggers can cause a pipeline to run against an object — all produce the same IngestJob shape, only rule_id differs:
| Trigger | When | rule_id on job |
|---|---|---|
| Direct upload | Successful S3 PUT (or CompleteMultipartUpload) on the bucket’s byte plane | Set — every rule that matches the object fires |
| Source sync | TriggerDiscovery (manual) or the source’s sync_interval_seconds tick | Set — same matching logic |
| One-shot trigger | TriggerIngest on a single ObjectRef | Optional — pass rule_id / pipeline_id explicitly or let server resolve |
Reliability:
- Transient failures retry automatically up to a bounded attempt count (the
INGEST_STATUS_RETRYINGstate). - After max attempts the job becomes
INGEST_STATUS_FAILEDwith the last error inerror/error_details. - Replay any failed or partial run by re-triggering with
TriggerIngestionon the source (withretry_failed = true) orTriggerIngeston a single object.
See also
- Quickstart — wire all six entities together in 5 minutes
- API Reference — gRPC + HTTP for every Source / Pipeline / Ingest RPC
- CLI Guide —
dodil k3 source·pipeline·template·ingest·credential - Vector — destination kind for
vectorpipelines - Tables — destination kind for
warehousepipelines - Conventions — auth headers, error envelope, wire encoding rules