Skip to Content
We are live but in Staging 🎉
TablesRecipesCTAS & Materialize

CTAS & Materialize

Goal: build a new table from a SQL query over an existing one. This is K3’s in-database ETL — denormalize, sessionize, summarize, snapshot — without leaving the bucket.

Two ways to do it, same dispatcher underneath:

ApproachAPIWhen to use
CTAS (CREATE TABLE … AS SELECT …)Execute with raw SQLYou’re already in a SQL workflow; the source + select + target all fit in one statement
MaterializeMaterialize (structured RPC)Programmatic callers / UIs with separate inputs for source / SQL / target / partitioning / mode

Both run K3’s planner against the source SELECT, materialize the rows, and write to the target.

Shape:

Source table ──► SELECT (joins, aggregates, windows, JSON ops, …) New / appended / replaced target table (own Delta history)

Prerequisites

  • A bucket with at least one populated table. We’ll use the events table from Manual Table.
  • dodil CLI + dodil login.
  • The CLI doesn’t yet expose Execute for write SQL (DDL/CTAS) or Materialize — we’ll use curl against the API for both.

1. Simplest CTAS — summarize a source

The classic denormalization pattern. Derive a click_summary table from events:

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/_execute" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "sql": "CREATE TABLE click_summary AS SELECT user_id, COUNT(*) AS n, MAX(occurred_at) AS last_click FROM events WHERE event_type = '\''click'\'' GROUP BY user_id" }'

Sample response:

{ "ddl": { "op": "create_table_as_select", "targetTable": "click_summary", "version": "0" } }

K3’s dispatcher ran:

  1. Query against events (planner picked the right read strategy).
  2. Inferred the target schema from the result columns.
  3. CreateTable for click_summary.
  4. Insert of the materialized rows.

Type inference rules

When you don’t specify column types explicitly, K3 infers from the SELECT result:

SELECT resultInferred K3 type
Integer numbers (COUNT(*), SUM(int), …)long
Fractional numbers (AVG, SUM(double), ratio expressions)double
Stringsstring
Booleansboolean
Timestamp / date literalstimestamp / date

Watch out: integer-shaped JSON values become long, even when the source column was int. For typed-stable migrations, use the structured CreateTable RPC with explicit columns instead.

2. CTAS with explicit column types

When you don’t want type-inference surprises, declare columns explicitly. K3’s CTAS syntax accepts the same CREATE TABLE ... AS SELECT ... shape; the planner respects the column types you write:

CREATE TABLE click_summary ( user_id VARCHAR NOT NULL, n INTEGER NOT NULL, last_click TIMESTAMP NOT NULL, PRIMARY KEY (user_id) ) AS SELECT user_id, COUNT(*)::INTEGER AS n, MAX(occurred_at) AS last_click FROM events WHERE event_type = 'click' GROUP BY user_id;

Send via Execute:

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/_execute" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "sql": "CREATE TABLE click_summary (user_id VARCHAR NOT NULL, n INTEGER NOT NULL, last_click TIMESTAMP NOT NULL, PRIMARY KEY (user_id)) AS SELECT user_id, COUNT(*)::INTEGER AS n, MAX(occurred_at) AS last_click FROM events WHERE event_type = '\''click'\'' GROUP BY user_id" }'

PRIMARY KEY on the new table flows to the planner; subsequent UPDATE / DELETE against click_summary will route through keyed strategies for free.

3. Structured CTAS via Materialize

Materialize is the structured-RPC cousin of CTAS — same dispatcher, different wire shape. Useful when your caller has separate input fields and doesn’t want to template SQL.

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "sourceTable": "events", "sql": "SELECT user_id, COUNT(*) AS n, MAX(occurred_at) AS last_click FROM {table} WHERE event_type = '\''click'\'' GROUP BY user_id", "targetTable": "click_summary", "partitionColumns": [], "mode": "create" }'

Sample response:

{ "targetTable": "click_summary", "targetLocation": "s3://kb-prod/.../click_summary", "rowsMaterialized": "342", "version": "0" }

{table} placeholder in the SQL refers to sourceTable. The dispatcher rewrites it before the underlying SELECT runs.

Modes

modeBehavior
create (default)Fails with FAILED_PRECONDITION if targetTable already exists
replaceDrops target if it exists, then recreates from the SELECT
appendRequires existing target with matching schema; appends the SELECT rows

Partitioning the target

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "sourceTable": "events", "sql": "SELECT user_id, event_type, occurred_at, payload FROM {table}", "targetTable": "events_repartitioned", "partitionColumns": ["event_type", "user_id"], "mode": "create" }'

Reads that filter on (event_type, user_id) will prune both partitions — useful for shifting hot read shapes without rewriting the source.

4. Append mode — incremental materialization

Build a daily summary table that grows over time:

# Day 1 — create with mode=create curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "sourceTable": "events", "sql": "SELECT DATE_TRUNC('\''day'\'', occurred_at) AS day, event_type, COUNT(*) AS n FROM {table} WHERE occurred_at >= TIMESTAMP '\''2026-05-26'\'' AND occurred_at < TIMESTAMP '\''2026-05-27'\'' GROUP BY 1, 2", "targetTable": "daily_event_counts", "partitionColumns": ["day"], "mode": "create" }' # Day 2+ — mode=append, same target curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "sourceTable": "events", "sql": "SELECT DATE_TRUNC('\''day'\'', occurred_at) AS day, event_type, COUNT(*) AS n FROM {table} WHERE occurred_at >= TIMESTAMP '\''2026-05-27'\'' AND occurred_at < TIMESTAMP '\''2026-05-28'\'' GROUP BY 1, 2", "targetTable": "daily_event_counts", "partitionColumns": ["day"], "mode": "append" }'

Run the append-day job nightly from a cron / scheduler. Each run adds one new day’s partitions to daily_event_counts.

5. Replace mode — snapshot refreshes

For dashboards / views that should always reflect the current state of the source (not a growing log), use mode: replace:

# Refresh the top-100 active users — runs hourly curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "sourceTable": "events", "sql": "SELECT user_id, COUNT(*) AS events_24h FROM {table} WHERE occurred_at > NOW() - INTERVAL 1 DAY GROUP BY user_id ORDER BY events_24h DESC LIMIT 100", "targetTable": "top_users_24h", "mode": "replace" }'

replace drops the target + recreates from the SELECT. Faster than DELETE FROM ... + INSERT ... because it’s a single Delta commit and avoids the write-log roundtrip.

6. Broadcast JOIN with lookup_tables

To JOIN against a second table in the source SELECT, pass an alias map. K3 registers the lookup as a DuckDB view scoped to the materialize call:

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "sourceTable": "events", "sql": "SELECT e.user_id, u.email, COUNT(*) AS clicks FROM {table} e JOIN users u ON e.user_id = u.id WHERE e.event_type = '\''click'\'' GROUP BY e.user_id, u.email", "targetTable": "user_click_counts", "lookupTables": { "users": "users" }, "mode": "replace" }'

lookupTables is alias → table-name. Use the alias in the SQL (JOIN users u), and the dispatcher resolves it to the actual table name on the bucket.

7. Verify the new table

# Schema + Delta location of the new table dodil k3 table describe click_summary --bucket kb-prod -o json \ | jq '{tableName, location, version, columns: [.columns[] | {name, type}]}' # Spot-check the rows dodil k3 table query \ "SELECT * FROM click_summary ORDER BY n DESC LIMIT 5" \ --bucket kb-prod # History — should have one CREATE TABLE commit curl -sS "https://k3.dev.dodil.io/kb-prod/tables/click_summary/history?limit=5" \ -H "Authorization: Bearer $DODIL_TOKEN" \ | jq '.entries[] | {version, operation}'

8. Sessionization — a worked window-function example

Use materialize to compute session boundaries — a common preparation for analytics dashboards. We use LAG() (DuckDB window function) to find the previous event time per user, then label sessions as gaps > 30 min:

curl -sS -X POST "https://k3.dev.dodil.io/kb-prod/tables/events/materialize" \ -H "Authorization: Bearer $DODIL_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "bucket": "kb-prod", "sourceTable": "events", "sql": "WITH gaps AS (SELECT user_id, occurred_at, event_type, payload, occurred_at - LAG(occurred_at) OVER (PARTITION BY user_id ORDER BY occurred_at) AS gap_us FROM {table}) SELECT user_id, occurred_at, event_type, payload, gap_us, SUM(CASE WHEN gap_us IS NULL OR gap_us > 30 * 60 * 1000000 THEN 1 ELSE 0 END) OVER (PARTITION BY user_id ORDER BY occurred_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_number FROM gaps", "targetTable": "events_sessionized", "partitionColumns": ["user_id"], "mode": "replace" }'

The resulting events_sessionized table has the original event rows plus a session_number column per user — partitioned by user_id so per-user session queries prune efficiently.

Common gotchas

SymptomCauseFix
CTAS succeeds but click_summary.n is wider than expectedType inference promoted COUNT(*) to long (int64) when you expected intUse explicit CREATE TABLE click_summary (n INTEGER NOT NULL, ...) AS SELECT ... with the type you want
Materialize mode: create fails with FAILED_PRECONDITIONTarget table already existsSwitch to mode: replace if you want to overwrite, or mode: append to add to existing
mode: append fails with schema mismatchThe SELECT produced columns that don’t match the existing target schemaEnsure SELECT columns + types align exactly with the target’s existing schema; use DescribeTable on the target to confirm
CTAS with JOIN fails — “table not found”The second table needs lookupTables (Materialize) or must be referenced by literal name in Execute SQLSwitch to Materialize with lookupTables, or write the literal table name in Execute’s SQL
Materialize succeeds but rowsMaterialized is 0SELECT returned no rows (filters too strict, empty source, etc.)Run the SELECT alone via dodil k3 table query to confirm what it returns
version is 0 on the responseWrite routed through the write log; Delta hasn’t committed yetRun compact to drain — version advances on the next commit
Repeated mode: replace builds up Delta versionsEach replace is a DROP + CREATE = 2+ Delta commits. Over time, history growsPeriodic Vacuum (after retention window) reclaims old versions. See Time Travel & Restore for the safety pattern.

Comparison — CTAS vs Materialize at a glance

CREATE TABLE AS SELECT (CTAS)Materialize RPC
Wire shapeSingle SQL string sent to ExecuteStructured fields (source / SQL / target / mode / partitions / lookups)
Mode selectorImplicit (always “create”); use DROP TABLE IF EXISTS + CTAS for replaceExplicit mode field (create / replace / append)
PartitioningInside the SQL (PARTITIONED BY ...)Top-level partition_columns field
Broadcast JOINsLiteral table nameslookup_tables alias map
Type controlInference (or full-explicit CREATE TABLE (cols) AS SELECT)Same inference
Best forAdhoc CLI / SQL scriptsProgrammatic / UI-driven materialization

See also