Learn Duckle

An interactive walkthrough of the whole machine

skip intro
Duckle
Get Duckle

Interactive walkthrough

Learn Duckle. The whole machine.

A hands-on tour of the entire product and codebase - one idea per screen. Step through with the arrows. No scrolling.

What you'll learn

Ten chapters, start to finish.

The mental modelAll 359 componentsEvery transform, explained The engine internalsThe 7-crate codebaseReal pipelines Use casesShippingThe trust suiteQuizzes along the way

Press the right arrow or Next to begin.

01   Local-first

Your data never leaves the machine in front of you.

Duckle runs DuckDB in-process. There is no server to upload rows to, and no warehouse bill for a preview. The compute is the chip already on your desk.

  • Read from files, databases and APIs straight into memory on your own CPU.
  • Transform with DuckDB's vectorized engine - a billion rows on a laptop, no cluster.
  • Write results wherever you choose. The bytes only travel if you send them.

01   The canvas is the program

Boxes and wires. That's a real program.

You build a pipeline by dragging nodes onto a canvas and wiring them together.

  • A node is one step - read a table, filter it, write it.
  • A wire is data flowing from one step to the next.
  • The graph is the execution order. Duckle reads it top to bottom.

01   It compiles to SQL

source
Orders
src.csv
transform
Keep paid
xf.filter
sink
Paid orders
snk.parquet

No magic. Just SQL you can see.

compiled.sql
CREATE VIEW orders AS  SELECT * FROM read_csv_auto('orders.csv'); CREATE VIEW keep_paid AS  SELECT * FROM orders WHERE status='paid'; COPY keep_paid TO 'paid.parquet' (FORMAT parquet);
Quick check

With Duckle, where does your data get processed?

Right - Duckle embeds DuckDB in-process. Data is read, transformed and written on your CPU; nothing is uploaded unless a sink sends it.

02   The parts

359

components ship today · 341 available now, plus preview and planned

Sources 109 Transforms 133 Sinks 66 Quality 25 Control 19 Code 7

Five families do the work; Code is the escape hatch. Learn the five verbs and you can read any pipeline.

Family 1 of 5 · read data in

Sources

A source pulls rows into the pipeline. Files, warehouses, app databases, object stores, streams, REST APIs and SaaS tools - 109 of them.

The next pages list every source. Green dot = available now, grey = planned, purple = preview. Step through them.

Family 2 of 5 · reshape data

Transforms

Transforms change data. This is where most of the work lives - 133 of them. First, let's actually watch the important ones work.

Every transform is one promise: a table in, a table out. Watch the rows.

Filter keeps only the rows that match a rule.
in · orders
idnameamtstatus
1Ada120paid
2Lin45pending
3Omar300paid
4Zoe90refunded
xf.filter
out
idnameamtstatus
1Ada120paid
3Omar300paid
SELECT * FROM orders WHERE status = 'paid' AND amt >= 100;
Group by & aggregate collapses rows that share a key into one summary row.
in · sales
regionproductunits
WestWidget10
WestGadget5
EastWidget8
WestWidget3
EastGadget12
xf.groupby
out · one row per region
regiontotal_unitsorders
West183
East202
SELECT region, SUM(units) AS total_units, COUNT(*) AS orders FROM sales GROUP BY region;
Join (and the visual Mapper) matches two tables on a key and stitches their columns.
in · orders (left)
ordercustamt
1011120
102245
1034300
in · customers (right)
custname
1Ada
2Lin
xf.join
out · inner join on cust
ordercustamtname
1011120Ada
102245Lin

Order 103 has no customer 4, so it drops.

SELECT o.order, o.cust, o.amt, c.name FROM orders o JOIN customers c ON o.cust = c.cust;
Window functions compute across rows without collapsing them: same row count, new columns.
in · daily
dayrev
1100
2140
390
4200
xf.window
out · same 4 rows, 3 new columns
dayrevrunningprevrank
1100100-3
21402401002
3903301404
4200530901
SELECT day, rev, SUM(rev) OVER (ORDER BY day) AS running, LAG(rev) OVER (ORDER BY day) AS prev, RANK() OVER (ORDER BY rev DESC) AS rank FROM daily;
Pivot turns row values into columns - long data becomes wide.
in · long
regionquartersales
WestQ1100
WestQ2150
EastQ180
EastQ2120
xf.pivot
out · wide
regionQ1Q2
West100150
East80120

Reverse with xf.unpivot.

PIVOT sales ON quarter USING SUM(sales) GROUP BY region;

04   The same pattern, everywhere

Five more, at a glance.

Cast
text to a real type
"19.99" → 19.99
xf.cast
Sort & Top-N
order rows; keep first N
ORDER BY rev DESC LIMIT 3
xf.topn
Distinct / Dedupe
drop dupes; one per key
1 per email, keep latest
qa.dedupe
Incremental
only rows past a watermark
WHERE updated_at > mark
xf.incremental
CDC / SCD2
close old row, add new; history kept
valid_to set + insert
xf.cdc.scd2
Row hash / audit
fingerprint + run metadata
md5(row) + loaded_at
xf.row_hash

04   AI transforms, local

Prepare text and vectors, in the flow.

Chunk
split long text into overlapping pieces
1 row → many chunks
xf.ai.chunk
PII redact
detect and mask personal data
555-0142 → [PHONE]
xf.ai.pii
Embed
turn text into a vector
text → [0.02, -0.88, ...]
xf.ai.embed
Classify / LLM
label or transform per row with a model
note → category
xf.ai.classify
Quick check

Group by turns many input rows into...

Correct - group by collapses rows sharing a key into one aggregated row. (Adding columns without collapsing is what a window function does.)
Every transform

All 133 transforms

You've seen the pattern. Here is the full set - reshape, combine, enrich, quality-fix and AI, page by page.

Family 3 of 5 · write data out

Sinks

A sink lands the result - Parquet, Iceberg, a database table, an object store, MotherDuck, even an email. 66 of them.

Every sink supports append; most support upsert and delete propagation.

Family 4 of 5 · guard data

Quality

Assert what must be true before bad data spreads: expectations, uniqueness, referential integrity, profiling, reconciliation. 25 of them.

A failing check can stop the run, or route the bad rows aside to a dead-letter path.

Family 5 of 5 · orchestrate steps

Control

Wire logic between steps: loop over items, call a child pipeline, run branches in parallel, log, warn, or stop the run on a condition. 19 of them.

Control is what turns a pipeline into a program.

The escape hatch · drop to code

Code

When a box will not do, drop to code inline: raw SQL, Python, JavaScript, a shell command, WASM, DuckDB. 7 of them - so you never hit a wall.

05   Under the hood

The canvas is not a metaphor. It is a plan.

Press Run and Duckle topologically sorts your graph into an ordered plan of DuckDB SQL stages. Five ideas keep it fast and honest.

  • Plan of stages. The graph becomes a numbered 1 to N order. Each node is one SQL step.
  • Lazy views. Most stages compile to a CREATE VIEW, not a table. Nothing computes until a sink pulls.
  • ATTACH, not copy. A database source runs ATTACH ... AS duckle_src and queries in place, then detaches.
  • Batched vs per-stage. One fast pass by default; flip materialization on a node for a safe on-disk checkpoint.
  • Reject ports. Quality nodes have a second output - failing rows fall to a dead-letter path.
Quick check

Most pipeline stages compile to a...

Right - stages are lazy views by default, so work only runs when a sink pulls from it. You opt into materializing a table only where you want a checkpoint.

06   The codebase

Seven pieces. One engine they share.

Duckle is a Rust workspace. Every surface calls into the same engine, so a pipeline runs identically everywhere. Click a piece to explore.

07   Complex pipelines

The ones that used to need a team.

Join four systems

Blend a warehouse, an app database, a CSV and an API - no staging, no copies.

01
4 sources
postgres · snowflake · csv · rest
Map + join
xf.map · xf.join
One table
snk.parquet

A billion rows, locally

Roll up a billion-row fact table on your laptop; ship only the small result to the cloud.

01
Big parquet
src.parquet 1e9
Aggregate
xf.groupby
Tiny result
snk.motherduck

Lakehouse time-travel

Read a DuckLake table as of two moments and diff them to see exactly what changed.

01
As-of reads
src.ducklake.diff
Summarize
xf.diffsummary
Report
snk.csv

07   What people build

Eight jobs, one tool.

01

Warehouse cost cut

Pre-aggregate locally; send only the rollup to Snowflake or BigQuery.

02

Database migration

Move Oracle or SQL Server to Postgres with upsert and CDC, verified to the row.

03

Lakehouse ingestion

Land files into Iceberg or DuckLake with schema control and time travel.

04

Reverse ETL

Push modeled data back out to a SaaS app - Slack, HubSpot, a webhook.

05

Data quality gate

Block a load when expectations, uniqueness or referential integrity fail.

06

AI / RAG prep

Chunk, redact PII and embed documents into a vector store for retrieval.

07

Observability rollups

Turn raw logs and metrics into SLOs, p95s and forecasts for dashboards.

08

Scheduled reports

Run nightly on a cron and drop a fresh Parquet or emailed CSV.

08   Ship it

Build it once. Run it five ways.

The same pipeline file runs from the studio, a terminal, a browser, a single binary, or an AI agent. Same engine, same result.

  • Runner. duckle run pipeline.json - a lean headless CLI for cron, systemd or CI.
  • Serve. duckle serve hosts a web console with runs history and a cron scheduler.
  • Build a binary. Export a self-contained executable with secrets bundled.
  • MCP. An LLM can list components, generate a validated pipeline, run it, read the logs.
  • Duckie. A local copilot (Qwen via llama.cpp) that builds pipelines for you, on your machine.

09   Know what changed

Data you can trust, and prove.

Drag the timeline. The result rewrites itself to the table's state as of that moment - real time travel, no backup to restore.

revenue by city · as of 2024-04-05
cityrevenue
JanFebMarApr

09   The trust suite

Three more guarantees.

  • Column lineage. Trace any output column backward through every node to the exact source table and column.
  • .ducklock. A signed lockfile pins the schema and shape a run expects, so drift is caught, not discovered.
  • Contracts & review. Gate a pipeline on data contracts and run duckle review before you promote a change.
Last check

What lets Duckle show a table exactly as it was last month?

Correct - Duckle reads the table as of a past snapshot directly, and can diff two moments. No restore, no re-fetch.

10   You've seen the whole machine

Now godraw the boxes.

Sources read, transforms reshape, sinks write, quality guards, control orchestrates - and it all compiles to SQL you can see.

01 / 1
Use to step