Core Concepts — Tables

Every type signature below is verbatim from proto/proto-k3/k3_tables.proto. Package: dodil.k3.tables.v1. Wire encodings: gRPC follows proto types directly; HTTP uses pbjson (camelCase, int64 as JSON strings, enums as wire-name strings).

The six types: Engine, Table, Column, QueryStrategy / Freshness / ServedBy (read-side), WriteStrategy (write-side), HistoryEntry (time travel). How they relate:


   bucket ──┬── Engine (auto-enabled per bucket; 1:1)
            │
            └── Tables ─── Columns (incl. JSON, partitioning, PK)
                  │
                  ├── Writes ─► WriteStrategy (KeyedInsert / Merge / NonKeyed…)
                  │
                  ├── Reads  ─► QueryStrategy (UnaryWarehouse / UnaryMergedStrong / Federated…)
                  │             + Freshness (Eventual / Strong / OltpOnly)
                  │
                  └── History ── time-travel via Restore

The two-layer HTAP model in plain terms: writes append to a transactional write log inside the bucket; an asynchronous compactor drains the log into the table’s Delta Lake commit history. Reads can scan Delta only (eventual), merge the log with Delta (strong), or hit the log only (rare). Maintenance ops (Optimize / Vacuum / Compact / Restore / History) operate on the Delta side directly.

`Engine`

The per-bucket table backend. Auto-enabled on CreateBucket; you usually don’t touch it.

HTTP


{
  "engineId": "eng_a1b2…",
  "bucket": "kb-prod",
  "status": "ENGINE_STATUS_ACTIVE",
  "settings": "{}",
  "tableCount": 4,
  "createdAt": "1716840000000",
  "updatedAt": "1716843600000"
}

Key facts:

One engine per bucket. engine_id is internal; APIs that touch the engine take bucket and look it up.
Status only flips when you explicitly call DisableEngine / EnableEngine. The default is ACTIVE from bucket creation.
settings is a JSON string for forward-compatible advanced knobs. You almost never set it.

`Table`

A structured dataset in a bucket. Has columns, partition columns, merge keys, and one of two sources — manual (you defined the schema) or pipeline-generated (a Scriptum template owns the schema).

HTTP


{
  "tableId": "tbl_a1b2…",
  "bucket": "kb-prod",
  "engineId": "eng_a1b2…",
  "name": "events",
  "description": "Click + purchase events",
  "columns": [
    { "name": "id",          "type": "COLUMN_TYPE_LONG",      "nullable": false },
    { "name": "user_id",     "type": "COLUMN_TYPE_STRING",    "nullable": false },
    { "name": "occurred_at", "type": "COLUMN_TYPE_TIMESTAMP", "nullable": false },
    { "name": "event_type",  "type": "COLUMN_TYPE_STRING",    "nullable": false },
    { "name": "payload",     "type": "COLUMN_TYPE_JSON",      "nullable": true  }
  ],
  "partitionColumns": ["event_type"],
  "mergeKeys": ["id", "user_id"],
  "source": "TABLE_SOURCE_MANUAL",
  "deltaVersion": "42",
  "rowCount": "1247",
  "sizeBytes": "8923471",
  "status": "TABLE_STATUS_ACTIVE",
  "createdAt": "1716840000000",
  "updatedAt": "1716843600000"
}

Two creation modes

	Manual	Pipeline-generated
RPC	`CreateTable`	`CreateTablePipeline`
Who owns the schema	You — pass columns explicitly	The Scriptum template — schema materializes lazily on first ingest
`source`	`TABLE_SOURCE_MANUAL`	`TABLE_SOURCE_PIPELINE_GENERATED`
`pipeline_id` populated	no	yes — auto-bound to the template’s pipeline
Auto-generated rule	no	yes — globs derived from the template’s `accepted_extensions`
When to pick	Structured data + SQL-first workflow	Unstructured documents → auto-extracted rows

Other key facts:

merge_keys is the table’s primary key (composite supported). Drives the write planner — writes with predicates that match merge_keys route through the keyed strategies (WAL → drain MERGE); other predicates route through non_keyed strategies (warehouse-only — see WriteStrategy).
partition_columns controls physical layout in Delta. Reads that filter on a partition column can prune entire directories.
delta_version advances on every commit (writes, optimize, vacuum, restore). Use it as the cursor for History pagination.
row_count / size_bytes are sidecar stats — accurate after the last drain, slightly stale during heavy writes.

`Column`


message Column {
  string     name     = 1;
  ColumnType type     = 2;
  bool       nullable = 3;
  // BigQuery-aligned. Empty = no default.
  // Examples: "GENERATE_UUID()", "CURRENT_TIMESTAMP()", "ULID()", "42", "'pending'"
  string default_expression = 4;
}
 
enum ColumnType {
  COLUMN_TYPE_UNSPECIFIED = 0;
  COLUMN_TYPE_STRING      = 1;   // utf-8 string
  COLUMN_TYPE_INT         = 2;   // int32
  COLUMN_TYPE_LONG        = 3;   // int64
  COLUMN_TYPE_SHORT       = 4;   // int16
  COLUMN_TYPE_FLOAT       = 5;   // float32
  COLUMN_TYPE_DOUBLE      = 6;   // float64
  COLUMN_TYPE_BOOLEAN     = 7;
  COLUMN_TYPE_DATE        = 8;   // YYYY-MM-DD
  COLUMN_TYPE_TIMESTAMP   = 9;   // microseconds since epoch UTC
  COLUMN_TYPE_BINARY      = 10;  // bytes
  COLUMN_TYPE_JSON        = 11;  // arbitrary JSON, DuckDB-queryable
}

Key facts:

JSON columns are first-class — DuckDB’s native JSON ops (->, ->>, json_extract, json_each, from_json) all work directly on them. For high-volume read paths, promote hot fields to typed columns; reserve json for long-tail / metadata.
default_expression is BigQuery-aligned SQL text applied at row commit when the column is absent from the input row. Validated against the column type at table creation.
Empty default_expression means the column is required at write time unless nullable = true (in which case absence yields SQL NULL).

Read-side observability — `QueryStrategy` / `Freshness` / `ServedBy`

When you run a SELECT through Execute, the response tells you exactly which strategy the planner picked and which tier(s) answered:


// Freshness selector on the request — picks the read path.
enum Freshness {
  FRESHNESS_UNSPECIFIED  = 0;   // treated as EVENTUAL
  FRESHNESS_EVENTUAL     = 1;   // Delta only (default — OLAP-optimized)
  FRESHNESS_STRONG       = 2;   // write log + Delta merged (read-your-writes)
  FRESHNESS_OLTP_ONLY    = 3;   // write log only (niche; point lookups during heavy backlog)
}
 
// Strategy chosen by the planner (returned on the response).
enum QueryStrategy {
  QUERY_STRATEGY_UNSPECIFIED         = 0;
  QUERY_STRATEGY_UNARY_WAREHOUSE     = 1;   // single Delta scan
  QUERY_STRATEGY_UNARY_MERGED_STRONG = 2;   // single-pod write-log + Delta merge
  QUERY_STRATEGY_FEDERATED_AGGREGATE = 3;   // partitions fanned out, results reduced
  QUERY_STRATEGY_FEDERATED_SCAN      = 4;   // partitions fanned out, results concatenated
  QUERY_STRATEGY_FEDERATED_TO_SINGLE = 5;   // planner downgraded to single (small table, no partitions)
}
 
// Which tier(s) actually answered.
enum ServedBy {
  SERVED_BY_UNSPECIFIED = 0;
  SERVED_BY_WAREHOUSE   = 1;   // Delta only
  SERVED_BY_MERGED      = 2;   // write log + Delta
  SERVED_BY_OLTP        = 3;   // write log only
}

How they interact:

You request…	Planner picks…	Tier…
Default (`EVENTUAL`)	`UNARY_WAREHOUSE` or `FEDERATED_*` depending on partitioning + size	`WAREHOUSE`
`STRONG`	`UNARY_MERGED_STRONG`	`MERGED`
`OLTP_ONLY`	(single-pod write-log scan)	`OLTP`
Large partitioned table + aggregate query	`FEDERATED_AGGREGATE`	`WAREHOUSE` (fanned-out)
Large partitioned table + non-aggregate SELECT	`FEDERATED_SCAN`	`WAREHOUSE` (fanned-out + concatenated)
Federated plan but table too small / unpartitioned	`FEDERATED_TO_SINGLE` (`strategy_reason` explains)	`WAREHOUSE`

The strategy + tier come back on ExecuteQueryResult.strategy / served_by for every read, alongside an optional human-readable strategy_reason for downgrades.

Write-side observability — `WriteStrategy`

The mirror enum for writes — every Insert / Merge / Update / DeleteRows / Execute(SQL) write returns the strategy the planner picked:


enum WriteStrategy {
  WRITE_STRATEGY_UNSPECIFIED = 0;
 
  // ── PK present + WHERE matches the PK ── (write-log → drain MERGE)
  WRITE_STRATEGY_KEYED_INSERT_SINGLE        = 1;
  WRITE_STRATEGY_KEYED_INSERT_BULK          = 2;
  WRITE_STRATEGY_KEYED_INSERT_FROM_SELECT   = 3;
  WRITE_STRATEGY_KEYED_UPDATE               = 4;
  WRITE_STRATEGY_KEYED_RANGE                = 5;
  WRITE_STRATEGY_KEYED_FROM_SUBQUERY        = 6;
  WRITE_STRATEGY_KEYED_DELETE               = 7;
  WRITE_STRATEGY_KEYED_RANGE_DELETE         = 8;
  WRITE_STRATEGY_KEYED_DELETE_FROM_SUBQUERY = 9;
 
  // ── No PK or WHERE doesn't match PK ── (Delta-only — ⚠️ WAL-bypass)
  WRITE_STRATEGY_NON_KEYED_UPDATE             = 10;
  WRITE_STRATEGY_NON_KEYED_DELETE             = 11;
  WRITE_STRATEGY_NON_KEYED_INSERT             = 15;
  WRITE_STRATEGY_NON_KEYED_INSERT_FROM_SELECT = 16;
 
  // ── MERGE ──
  WRITE_STRATEGY_MERGE_ROWS  = 12;
  WRITE_STRATEGY_MERGE_QUERY = 13;
  WRITE_STRATEGY_MERGE_TABLE = 14;
}

Key facts:

Keyed strategies route through the write log → the compactor drains them into Delta on the next tick. Predicates must match the table’s merge_keys.
Non-keyed strategies bypass the write log and write directly to Delta. Faster for one-shot writes but carry a risk: if pending log entries for the same rows haven’t compacted yet, the next drain may overwrite the Delta-side change. Use after Compact, or restrict to tables / paths where you control timing.
MERGE strategies distinguish source shape: explicit rows (MERGE_ROWS), source query (MERGE_QUERY), or source table (MERGE_TABLE).
The strategy is returned on every Execute(write SQL) response — useful for confirming the planner picked the cheap path.

For the full SQL-shape → strategy mapping, see Execute → Update (keyed/non-keyed UPDATE) or any of the Execute sub-pages.

`HistoryEntry`

The Delta Lake commit log. Every write, optimize, vacuum, restore, schema change produces one entry. Pagination cursor is version — strictly increasing.


message HistoryEntry {
  int64 version = 5;
  string timestamp = 1;                       // epoch-milliseconds (string)
  string operation = 2;                       // "WRITE" | "MERGE" | "UPDATE" | "DELETE" |
                                              // "OPTIMIZE" | "VACUUM START" | "VACUUM END" |
                                              // "RESTORE" | "CREATE TABLE" | …
  google.protobuf.Struct parameters = 3;      // op-specific (e.g. {"predicate":"id = 4"})
  google.protobuf.Struct operation_metrics = 6;
  google.protobuf.Struct user_metadata = 7;
  string engine_info = 8;
  int64 read_version = 4;                     // version this op read from; 0 on initial CREATE
}

Key facts:

version is the cursor — use HistoryRequest.before_version to page backwards.
timestamp is epoch-ms-as-string; order by version, not timestamp — clock skew across writers can cause tiny non-monotonicity.
operation_metrics carries the Delta-native counters per op (numTargetRowsInserted/Updated/Deleted for MERGE/UPDATE; numFilesAdded/Removed for OPTIMIZE; …).
user_metadata is whatever you attached at commit time via commitInfo.userMetadata. Empty when not set.
Time travel via Restore takes a version from this log — see Recipes → Time travel + restore.

When ingest runs (pipeline-bound tables)

Pipeline-generated tables (CreateTablePipeline) ingest automatically — same trigger model as Pipelines:

Direct upload of a matching object via S3 → Scriptum template runs → rows land in the table
Source sync (Preview) → discovered objects → template runs → rows land
One-shot manual via TriggerIngest → forces a single object through the bound pipeline

The auto-generated ingest rule (created with CreateTablePipeline) is a normal IngestRule — list / inspect / edit it via the Pipelines APIs.

Core Concepts — Tables

Engine

HTTP

gRPC

Table

HTTP

gRPC

Two creation modes

Column

Read-side observability — QueryStrategy / Freshness / ServedBy

Write-side observability — WriteStrategy

HistoryEntry

When ingest runs (pipeline-bound tables)

See also

`Engine`

`Table`

`Column`

Read-side observability — `QueryStrategy` / `Freshness` / `ServedBy`

Write-side observability — `WriteStrategy`

`HistoryEntry`