CTAS & Materialize
Goal: build a new table from a SQL query over an existing one. This is K3’s in-database ETL — denormalize, sessionize, summarize, snapshot — without leaving the bucket.
Two ways to do it, same dispatcher underneath:
| Approach | API | When to use |
|---|---|---|
CTAS (CREATE TABLE … AS SELECT …) | Execute with raw SQL | You’re already in a SQL workflow; the source + select + target all fit in one statement |
| Materialize | Materialize (structured RPC) | Programmatic callers / UIs with separate inputs for source / SQL / target / partitioning / mode |
Both run K3’s planner against the source SELECT, materialize the rows, and write to the target.
Shape:
Source table ──► SELECT (joins, aggregates, windows, JSON ops, …)
│
▼
New / appended / replaced
target table
(own Delta history)Prerequisites
- A bucket with at least one populated table. We’ll use the
eventstable from Manual Table. dodilCLI +dodil login.- The CLI doesn’t yet expose
Executefor write SQL (DDL/CTAS) orMaterialize— we’ll usecurlagainst the API for both.
1. Simplest CTAS — summarize a source
The classic denormalization pattern. Derive a click_summary table from events:
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/_execute" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"sql": "CREATE TABLE click_summary AS SELECT user_id, COUNT(*) AS n, MAX(occurred_at) AS last_click FROM events WHERE event_type = '\''click'\'' GROUP BY user_id"
}'Sample response:
{
"ddl": {
"op": "create_table_as_select",
"targetTable": "click_summary",
"version": "0"
}
}K3’s dispatcher ran:
Queryagainstevents(planner picked the right read strategy).- Inferred the target schema from the result columns.
CreateTableforclick_summary.Insertof the materialized rows.
Type inference rules
When you don’t specify column types explicitly, K3 infers from the SELECT result:
| SELECT result | Inferred K3 type |
|---|---|
Integer numbers (COUNT(*), SUM(int), …) | long |
Fractional numbers (AVG, SUM(double), ratio expressions) | double |
| Strings | string |
| Booleans | boolean |
| Timestamp / date literals | timestamp / date |
Watch out: integer-shaped JSON values become long, even when the source column was int. For typed-stable migrations, use the structured CreateTable RPC with explicit columns instead.
2. CTAS with explicit column types
When you don’t want type-inference surprises, declare columns explicitly. K3’s CTAS syntax accepts the same CREATE TABLE ... AS SELECT ... shape; the planner respects the column types you write:
CREATE TABLE click_summary (
user_id VARCHAR NOT NULL,
n INTEGER NOT NULL,
last_click TIMESTAMP NOT NULL,
PRIMARY KEY (user_id)
)
AS
SELECT user_id, COUNT(*)::INTEGER AS n, MAX(occurred_at) AS last_click
FROM events
WHERE event_type = 'click'
GROUP BY user_id;Send via Execute:
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/_execute" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"sql": "CREATE TABLE click_summary (user_id VARCHAR NOT NULL, n INTEGER NOT NULL, last_click TIMESTAMP NOT NULL, PRIMARY KEY (user_id)) AS SELECT user_id, COUNT(*)::INTEGER AS n, MAX(occurred_at) AS last_click FROM events WHERE event_type = '\''click'\'' GROUP BY user_id"
}'PRIMARY KEY on the new table flows to the planner; subsequent UPDATE / DELETE against click_summary will route through keyed strategies for free.
3. Structured CTAS via Materialize
Materialize is the structured-RPC cousin of CTAS — same dispatcher, different wire shape. Useful when your caller has separate input fields and doesn’t want to template SQL.
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"sourceTable": "events",
"sql": "SELECT user_id, COUNT(*) AS n, MAX(occurred_at) AS last_click FROM {table} WHERE event_type = '\''click'\'' GROUP BY user_id",
"targetTable": "click_summary",
"partitionColumns": [],
"mode": "create"
}'Sample response:
{
"targetTable": "click_summary",
"targetLocation": "s3://kb-prod/.../click_summary",
"rowsMaterialized": "342",
"version": "0"
}{table} placeholder in the SQL refers to sourceTable. The dispatcher rewrites it before the underlying SELECT runs.
Modes
mode | Behavior |
|---|---|
create (default) | Fails with FAILED_PRECONDITION if targetTable already exists |
replace | Drops target if it exists, then recreates from the SELECT |
append | Requires existing target with matching schema; appends the SELECT rows |
Partitioning the target
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"sourceTable": "events",
"sql": "SELECT user_id, event_type, occurred_at, payload FROM {table}",
"targetTable": "events_repartitioned",
"partitionColumns": ["event_type", "user_id"],
"mode": "create"
}'Reads that filter on (event_type, user_id) will prune both partitions — useful for shifting hot read shapes without rewriting the source.
4. Append mode — incremental materialization
Build a daily summary table that grows over time:
# Day 1 — create with mode=create
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"sourceTable": "events",
"sql": "SELECT DATE_TRUNC('\''day'\'', occurred_at) AS day, event_type, COUNT(*) AS n FROM {table} WHERE occurred_at >= TIMESTAMP '\''2026-05-26'\'' AND occurred_at < TIMESTAMP '\''2026-05-27'\'' GROUP BY 1, 2",
"targetTable": "daily_event_counts",
"partitionColumns": ["day"],
"mode": "create"
}'
# Day 2+ — mode=append, same target
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"sourceTable": "events",
"sql": "SELECT DATE_TRUNC('\''day'\'', occurred_at) AS day, event_type, COUNT(*) AS n FROM {table} WHERE occurred_at >= TIMESTAMP '\''2026-05-27'\'' AND occurred_at < TIMESTAMP '\''2026-05-28'\'' GROUP BY 1, 2",
"targetTable": "daily_event_counts",
"partitionColumns": ["day"],
"mode": "append"
}'Run the append-day job nightly from a cron / scheduler. Each run adds one new day’s partitions to daily_event_counts.
5. Replace mode — snapshot refreshes
For dashboards / views that should always reflect the current state of the source (not a growing log), use mode: replace:
# Refresh the top-100 active users — runs hourly
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"sourceTable": "events",
"sql": "SELECT user_id, COUNT(*) AS events_24h FROM {table} WHERE occurred_at > NOW() - INTERVAL 1 DAY GROUP BY user_id ORDER BY events_24h DESC LIMIT 100",
"targetTable": "top_users_24h",
"mode": "replace"
}'replace drops the target + recreates from the SELECT. Faster than DELETE FROM ... + INSERT ... because it’s a single Delta commit and avoids the write-log roundtrip.
6. Broadcast JOIN with lookup_tables
To JOIN against a second table in the source SELECT, pass an alias map. K3 registers the lookup as a DuckDB view scoped to the materialize call:
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"sourceTable": "events",
"sql": "SELECT e.user_id, u.email, COUNT(*) AS clicks FROM {table} e JOIN users u ON e.user_id = u.id WHERE e.event_type = '\''click'\'' GROUP BY e.user_id, u.email",
"targetTable": "user_click_counts",
"lookupTables": { "users": "users" },
"mode": "replace"
}'lookupTables is alias → table-name. Use the alias in the SQL (JOIN users u), and the dispatcher resolves it to the actual table name on the bucket.
7. Verify the new table
# Schema + Delta location of the new table
dodil k3 table describe click_summary --bucket kb-prod -o json \
| jq '{tableName, location, version, columns: [.columns[] | {name, type}]}'
# Spot-check the rows
dodil k3 table query \
"SELECT * FROM click_summary ORDER BY n DESC LIMIT 5" \
--bucket kb-prod
# History — should have one CREATE TABLE commit
curl -sS "https://k3.dev.dodil.io/kb-prod/tables/click_summary/history?limit=5" \
-H "Authorization: Bearer $DODIL_TOKEN" \
| jq '.entries[] | {version, operation}'8. Sessionization — a worked window-function example
Use materialize to compute session boundaries — a common preparation for analytics dashboards. We use LAG() (DuckDB window function) to find the previous event time per user, then label sessions as gaps > 30 min:
curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \
-H "Authorization: Bearer $DODIL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"bucket": "kb-prod",
"sourceTable": "events",
"sql": "WITH gaps AS (SELECT user_id, occurred_at, event_type, payload, occurred_at - LAG(occurred_at) OVER (PARTITION BY user_id ORDER BY occurred_at) AS gap_us FROM {table}) SELECT user_id, occurred_at, event_type, payload, gap_us, SUM(CASE WHEN gap_us IS NULL OR gap_us > 30 * 60 * 1000000 THEN 1 ELSE 0 END) OVER (PARTITION BY user_id ORDER BY occurred_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_number FROM gaps",
"targetTable": "events_sessionized",
"partitionColumns": ["user_id"],
"mode": "replace"
}'The resulting events_sessionized table has the original event rows plus a session_number column per user — partitioned by user_id so per-user session queries prune efficiently.
Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
CTAS succeeds but click_summary.n is wider than expected | Type inference promoted COUNT(*) to long (int64) when you expected int | Use explicit CREATE TABLE click_summary (n INTEGER NOT NULL, ...) AS SELECT ... with the type you want |
Materialize mode: create fails with FAILED_PRECONDITION | Target table already exists | Switch to mode: replace if you want to overwrite, or mode: append to add to existing |
mode: append fails with schema mismatch | The SELECT produced columns that don’t match the existing target schema | Ensure SELECT columns + types align exactly with the target’s existing schema; use DescribeTable on the target to confirm |
| CTAS with JOIN fails — “table not found” | The second table needs lookupTables (Materialize) or must be referenced by literal name in Execute SQL | Switch to Materialize with lookupTables, or write the literal table name in Execute’s SQL |
Materialize succeeds but rowsMaterialized is 0 | SELECT returned no rows (filters too strict, empty source, etc.) | Run the SELECT alone via dodil k3 table query to confirm what it returns |
version is 0 on the response | Write routed through the write log; Delta hasn’t committed yet | Run compact to drain — version advances on the next commit |
Repeated mode: replace builds up Delta versions | Each replace is a DROP + CREATE = 2+ Delta commits. Over time, history grows | Periodic Vacuum (after retention window) reclaims old versions. See Time Travel & Restore for the safety pattern. |
Comparison — CTAS vs Materialize at a glance
CREATE TABLE AS SELECT (CTAS) | Materialize RPC | |
|---|---|---|
| Wire shape | Single SQL string sent to Execute | Structured fields (source / SQL / target / mode / partitions / lookups) |
| Mode selector | Implicit (always “create”); use DROP TABLE IF EXISTS + CTAS for replace | Explicit mode field (create / replace / append) |
| Partitioning | Inside the SQL (PARTITIONED BY ...) | Top-level partition_columns field |
| Broadcast JOINs | Literal table names | lookup_tables alias map |
| Type control | Inference (or full-explicit CREATE TABLE (cols) AS SELECT) | Same inference |
| Best for | Adhoc CLI / SQL scripts | Programmatic / UI-driven materialization |
See also
- Execute → DDL — CREATE TABLE + CTAS spec
- Execute → Materialize — structured RPC spec
- Time Travel & Restore — pin a restore point by materializing a snapshot
- Manual Table — the source
eventstable this recipe runs against - SQL Compatibility — window functions / CTEs / JOIN reference