CTAS & Materialize

Goal: build a new table from a SQL query over an existing one. This is K3’s in-database ETL — denormalize, sessionize, summarize, snapshot — without leaving the bucket.

Two ways to do it, same dispatcher underneath:

Approach	API	When to use
CTAS (`CREATE TABLE … AS SELECT …`)	`Execute` with raw SQL	You’re already in a SQL workflow; the source + select + target all fit in one statement
Materialize	`Materialize` (structured RPC)	Programmatic callers / UIs with separate inputs for source / SQL / target / partitioning / mode

Both run K3’s planner against the source SELECT, materialize the rows, and write to the target.

Shape:


  Source table   ──►   SELECT (joins, aggregates, windows, JSON ops, …)
                                │
                                ▼
                       New / appended / replaced
                              target table
                       (own Delta history)

Prerequisites

A bucket with at least one populated table. We’ll use the events table from Manual Table.
dodil CLI + dodil login.
The CLI doesn’t yet expose Execute for write SQL (DDL/CTAS) or Materialize — we’ll use curl against the API for both.

1. Simplest CTAS — summarize a source

The classic denormalization pattern. Derive a click_summary table from events:


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/_execute" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "sql": "CREATE TABLE click_summary AS SELECT user_id, COUNT(*) AS n, MAX(occurred_at) AS last_click FROM events WHERE event_type = '\''click'\'' GROUP BY user_id"
  }'

Sample response:


{
  "ddl": {
    "op": "create_table_as_select",
    "targetTable": "click_summary",
    "version": "0"
  }
}

K3’s dispatcher ran:

Query against events (planner picked the right read strategy).
Inferred the target schema from the result columns.
CreateTable for click_summary.
Insert of the materialized rows.

Type inference rules

When you don’t specify column types explicitly, K3 infers from the SELECT result:

SELECT result	Inferred K3 type
Integer numbers (`COUNT(*)`, `SUM(int)`, …)	`long`
Fractional numbers (`AVG`, `SUM(double)`, ratio expressions)	`double`
Strings	`string`
Booleans	`boolean`
Timestamp / date literals	`timestamp` / `date`

Watch out: integer-shaped JSON values become long, even when the source column was int. For typed-stable migrations, use the structured CreateTable RPC with explicit columns instead.

2. CTAS with explicit column types

When you don’t want type-inference surprises, declare columns explicitly. K3’s CTAS syntax accepts the same CREATE TABLE ... AS SELECT ... shape; the planner respects the column types you write:


CREATE TABLE click_summary (
  user_id    VARCHAR NOT NULL,
  n          INTEGER NOT NULL,
  last_click TIMESTAMP NOT NULL,
  PRIMARY KEY (user_id)
)
AS
SELECT user_id, COUNT(*)::INTEGER AS n, MAX(occurred_at) AS last_click
FROM events
WHERE event_type = 'click'
GROUP BY user_id;

Send via Execute:


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/_execute" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "sql": "CREATE TABLE click_summary (user_id VARCHAR NOT NULL, n INTEGER NOT NULL, last_click TIMESTAMP NOT NULL, PRIMARY KEY (user_id)) AS SELECT user_id, COUNT(*)::INTEGER AS n, MAX(occurred_at) AS last_click FROM events WHERE event_type = '\''click'\'' GROUP BY user_id"
  }'

PRIMARY KEY on the new table flows to the planner; subsequent UPDATE / DELETE against click_summary will route through keyed strategies for free.

3. Structured CTAS via `Materialize`

Materialize is the structured-RPC cousin of CTAS — same dispatcher, different wire shape. Useful when your caller has separate input fields and doesn’t want to template SQL.


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "sourceTable": "events",
    "sql": "SELECT user_id, COUNT(*) AS n, MAX(occurred_at) AS last_click FROM {table} WHERE event_type = '\''click'\'' GROUP BY user_id",
    "targetTable": "click_summary",
    "partitionColumns": [],
    "mode": "create"
  }'

Sample response:


{
  "targetTable": "click_summary",
  "targetLocation": "s3://kb-prod/.../click_summary",
  "rowsMaterialized": "342",
  "version": "0"
}

{table} placeholder in the SQL refers to sourceTable. The dispatcher rewrites it before the underlying SELECT runs.

Modes

`mode`	Behavior
`create` (default)	Fails with `FAILED_PRECONDITION` if `targetTable` already exists
`replace`	Drops target if it exists, then recreates from the SELECT
`append`	Requires existing target with matching schema; appends the SELECT rows

Partitioning the target


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "sourceTable": "events",
    "sql": "SELECT user_id, event_type, occurred_at, payload FROM {table}",
    "targetTable": "events_repartitioned",
    "partitionColumns": ["event_type", "user_id"],
    "mode": "create"
  }'

Reads that filter on (event_type, user_id) will prune both partitions — useful for shifting hot read shapes without rewriting the source.

4. Append mode — incremental materialization

Build a daily summary table that grows over time:


# Day 1 — create with mode=create
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "sourceTable": "events",
    "sql": "SELECT DATE_TRUNC('\''day'\'', occurred_at) AS day, event_type, COUNT(*) AS n FROM {table} WHERE occurred_at >= TIMESTAMP '\''2026-05-26'\'' AND occurred_at < TIMESTAMP '\''2026-05-27'\'' GROUP BY 1, 2",
    "targetTable": "daily_event_counts",
    "partitionColumns": ["day"],
    "mode": "create"
  }'
 
# Day 2+ — mode=append, same target
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "sourceTable": "events",
    "sql": "SELECT DATE_TRUNC('\''day'\'', occurred_at) AS day, event_type, COUNT(*) AS n FROM {table} WHERE occurred_at >= TIMESTAMP '\''2026-05-27'\'' AND occurred_at < TIMESTAMP '\''2026-05-28'\'' GROUP BY 1, 2",
    "targetTable": "daily_event_counts",
    "partitionColumns": ["day"],
    "mode": "append"
  }'

Run the append-day job nightly from a cron / scheduler. Each run adds one new day’s partitions to daily_event_counts.

5. Replace mode — snapshot refreshes

For dashboards / views that should always reflect the current state of the source (not a growing log), use mode: replace:


# Refresh the top-100 active users — runs hourly
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "sourceTable": "events",
    "sql": "SELECT user_id, COUNT(*) AS events_24h FROM {table} WHERE occurred_at > NOW() - INTERVAL 1 DAY GROUP BY user_id ORDER BY events_24h DESC LIMIT 100",
    "targetTable": "top_users_24h",
    "mode": "replace"
  }'

replace drops the target + recreates from the SELECT. Faster than DELETE FROM ... + INSERT ... because it’s a single Delta commit and avoids the write-log roundtrip.

6. Broadcast JOIN with `lookup_tables`

To JOIN against a second table in the source SELECT, pass an alias map. K3 registers the lookup as a DuckDB view scoped to the materialize call:


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "sourceTable": "events",
    "sql": "SELECT e.user_id, u.email, COUNT(*) AS clicks FROM {table} e JOIN users u ON e.user_id = u.id WHERE e.event_type = '\''click'\'' GROUP BY e.user_id, u.email",
    "targetTable": "user_click_counts",
    "lookupTables": { "users": "users" },
    "mode": "replace"
  }'

lookupTables is alias → table-name. Use the alias in the SQL (JOIN users u), and the dispatcher resolves it to the actual table name on the bucket.

7. Verify the new table


# Schema + Delta location of the new table
dodil k3 table describe click_summary --bucket kb-prod -o json \
  | jq '{tableName, location, version, columns: [.columns[] | {name, type}]}'
 
# Spot-check the rows
dodil k3 table query \
  "SELECT * FROM click_summary ORDER BY n DESC LIMIT 5" \
  --bucket kb-prod
 
# History — should have one CREATE TABLE commit
curl -sS "https://k3.dev.dodil.io/kb-prod/tables/click_summary/history?limit=5" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  | jq '.entries[] | {version, operation}'

8. Sessionization — a worked window-function example

Use materialize to compute session boundaries — a common preparation for analytics dashboards. We use LAG() (DuckDB window function) to find the previous event time per user, then label sessions as gaps > 30 min:


curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
  -H "Authorization: Bearer $DODIL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "kb-prod",
    "sourceTable": "events",
    "sql": "WITH gaps AS (SELECT user_id, occurred_at, event_type, payload, occurred_at - LAG(occurred_at) OVER (PARTITION BY user_id ORDER BY occurred_at) AS gap_us FROM {table}) SELECT user_id, occurred_at, event_type, payload, gap_us, SUM(CASE WHEN gap_us IS NULL OR gap_us > 30 * 60 * 1000000 THEN 1 ELSE 0 END) OVER (PARTITION BY user_id ORDER BY occurred_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_number FROM gaps",
    "targetTable": "events_sessionized",
    "partitionColumns": ["user_id"],
    "mode": "replace"
  }'

The resulting events_sessionized table has the original event rows plus a session_number column per user — partitioned by user_id so per-user session queries prune efficiently.

Common gotchas

Symptom	Cause	Fix
CTAS succeeds but `click_summary.n` is wider than expected	Type inference promoted `COUNT(*)` to `long` (int64) when you expected `int`	Use explicit `CREATE TABLE click_summary (n INTEGER NOT NULL, ...) AS SELECT ...` with the type you want
Materialize `mode: create` fails with `FAILED_PRECONDITION`	Target table already exists	Switch to `mode: replace` if you want to overwrite, or `mode: append` to add to existing
`mode: append` fails with schema mismatch	The SELECT produced columns that don’t match the existing target schema	Ensure SELECT columns + types align exactly with the target’s existing schema; use `DescribeTable` on the target to confirm
CTAS with JOIN fails — “table not found”	The second table needs `lookupTables` (Materialize) or must be referenced by literal name in `Execute` SQL	Switch to `Materialize` with `lookupTables`, or write the literal table name in `Execute`’s SQL
Materialize succeeds but `rowsMaterialized` is 0	SELECT returned no rows (filters too strict, empty source, etc.)	Run the SELECT alone via `dodil k3 table query` to confirm what it returns
`version` is 0 on the response	Write routed through the write log; Delta hasn’t committed yet	Run `compact` to drain — version advances on the next commit
Repeated `mode: replace` builds up Delta versions	Each `replace` is a DROP + CREATE = 2+ Delta commits. Over time, history grows	Periodic `Vacuum` (after retention window) reclaims old versions. See Time Travel & Restore for the safety pattern.

Comparison — CTAS vs Materialize at a glance

	`CREATE TABLE AS SELECT` (CTAS)	`Materialize` RPC
Wire shape	Single SQL string sent to `Execute`	Structured fields (source / SQL / target / mode / partitions / lookups)
Mode selector	Implicit (always “create”); use `DROP TABLE IF EXISTS` + CTAS for replace	Explicit `mode` field (`create` / `replace` / `append`)
Partitioning	Inside the SQL (`PARTITIONED BY ...`)	Top-level `partition_columns` field
Broadcast JOINs	Literal table names	`lookup_tables` alias map
Type control	Inference (or full-explicit `CREATE TABLE (cols) AS SELECT`)	Same inference
Best for	Adhoc CLI / SQL scripts	Programmatic / UI-driven materialization