Interactive walkthrough

Learn Duckle. The whole machine.

A hands-on tour of the entire product and codebase - one idea per screen. Step through with the arrows. No scrolling.

What you'll learn

Ten chapters, start to finish.

The mental modelAll 359 componentsEvery transform, explained The engine internalsThe 7-crate codebaseReal pipelines Use casesShippingThe trust suiteQuizzes along the way

Press the right arrow or Next to begin.

01 Local-first

Your data never leaves the machine in front of you.

Duckle runs DuckDB in-process. There is no server to upload rows to, and no warehouse bill for a preview. The compute is the chip already on your desk.

Read from files, databases and APIs straight into memory on your own CPU.
Transform with DuckDB's vectorized engine - a billion rows on a laptop, no cluster.
Write results wherever you choose. The bytes only travel if you send them.

01 The canvas is the program

Boxes and wires. That's a real program.

You build a pipeline by dragging nodes onto a canvas and wiring them together.

A node is one step - read a table, filter it, write it.
A wire is data flowing from one step to the next.
The graph is the execution order. Duckle reads it top to bottom.

01 It compiles to SQL

source

Orders

src.csv

transform

Keep paid

xf.filter

sink

Paid orders

snk.parquet

No magic. Just SQL you can see.

compiled.sql

CREATE VIEW orders AS  SELECT * FROM read_csv_auto('orders.csv'); CREATE VIEW keep_paid AS  SELECT * FROM orders WHERE status='paid'; COPY keep_paid TO 'paid.parquet' (FORMAT parquet);

Quick check

With Duckle, where does your data get processed?

Right - Duckle embeds DuckDB in-process. Data is read, transformed and written on your CPU; nothing is uploaded unless a sink sends it.

02 The parts

359

components ship today · 341 available now, plus preview and planned

Sources 109 Transforms 133 Sinks 66 Quality 25 Control 19 Code 7

Five families do the work; Code is the escape hatch. Learn the five verbs and you can read any pipeline.

Family 1 of 5 · read data in

Sources

A source pulls rows into the pipeline. Files, warehouses, app databases, object stores, streams, REST APIs and SaaS tools - 109 of them.

The next pages list every source. Green dot = available now, grey = planned, purple = preview. Step through them.

Family 2 of 5 · reshape data

Transforms

Transforms change data. This is where most of the work lives - 133 of them. First, let's actually watch the important ones work.

Every transform is one promise: a table in, a table out. Watch the rows.

Filter keeps only the rows that match a rule.

in · orders

id	name	amt	status
1	Ada	120	paid
2	Lin	45	pending
3	Omar	300	paid
4	Zoe	90	refunded

xf.filter

out

id	name	amt	status
1	Ada	120	paid
3	Omar	300	paid

SELECT * FROM orders WHERE status = 'paid' AND amt >= 100;

Group by & aggregate collapses rows that share a key into one summary row.

in · sales

region	product	units
West	Widget	10
West	Gadget	5
East	Widget	8
West	Widget	3
East	Gadget	12

xf.groupby

out · one row per region

region	total_units	orders
West	18	3
East	20	2

SELECT region, SUM(units) AS total_units, COUNT(*) AS orders FROM sales GROUP BY region;

Join (and the visual Mapper) matches two tables on a key and stitches their columns.

in · orders (left)

order	cust	amt
101	1	120
102	2	45
103	4	300

in · customers (right)

cust	name
1	Ada
2	Lin

xf.join

out · inner join on cust

order	cust	amt	name
101	1	120	Ada
102	2	45	Lin

Order 103 has no customer 4, so it drops.

SELECT o.order, o.cust, o.amt, c.name FROM orders o JOIN customers c ON o.cust = c.cust;

Window functions compute across rows without collapsing them: same row count, new columns.

in · daily

day	rev
1	100
2	140
3	90
4	200

xf.window

out · same 4 rows, 3 new columns

day	rev	running	prev	rank
1	100	100	-	3
2	140	240	100	2
3	90	330	140	4
4	200	530	90	1

SELECT day, rev, SUM(rev) OVER (ORDER BY day) AS running, LAG(rev) OVER (ORDER BY day) AS prev, RANK() OVER (ORDER BY rev DESC) AS rank FROM daily;

Pivot turns row values into columns - long data becomes wide.

in · long

region	quarter	sales
West	Q1	100
West	Q2	150
East	Q1	80
East	Q2	120

xf.pivot

out · wide

region	Q1	Q2
West	100	150
East	80	120

Reverse with xf.unpivot.

PIVOT sales ON quarter USING SUM(sales) GROUP BY region;

04 The same pattern, everywhere

Five more, at a glance.

Cast

text to a real type

"19.99" → 19.99

xf.cast

Sort & Top-N

order rows; keep first N

ORDER BY rev DESC LIMIT 3

xf.topn

Distinct / Dedupe

drop dupes; one per key

1 per email, keep latest

qa.dedupe

Incremental

only rows past a watermark

WHERE updated_at > mark

xf.incremental

CDC / SCD2

close old row, add new; history kept

valid_to set + insert

xf.cdc.scd2

Row hash / audit

fingerprint + run metadata

md5(row) + loaded_at

xf.row_hash

04 AI transforms, local

Prepare text and vectors, in the flow.

Chunk

split long text into overlapping pieces

1 row → many chunks

xf.ai.chunk

PII redact

detect and mask personal data

555-0142 → [PHONE]

xf.ai.pii

Embed

turn text into a vector

text → [0.02, -0.88, ...]

xf.ai.embed

Classify / LLM

label or transform per row with a model

note → category

xf.ai.classify

Quick check

Group by turns many input rows into...

Correct - group by collapses rows sharing a key into one aggregated row. (Adding columns without collapsing is what a window function does.)

Every transform

All 133 transforms

You've seen the pattern. Here is the full set - reshape, combine, enrich, quality-fix and AI, page by page.

Family 3 of 5 · write data out

Sinks

A sink lands the result - Parquet, Iceberg, a database table, an object store, MotherDuck, even an email. 66 of them.

Every sink supports append; most support upsert and delete propagation.

Family 4 of 5 · guard data

Quality

Assert what must be true before bad data spreads: expectations, uniqueness, referential integrity, profiling, reconciliation. 25 of them.

A failing check can stop the run, or route the bad rows aside to a dead-letter path.

Family 5 of 5 · orchestrate steps

Control

Wire logic between steps: loop over items, call a child pipeline, run branches in parallel, log, warn, or stop the run on a condition. 19 of them.

Control is what turns a pipeline into a program.

The escape hatch · drop to code

Code

When a box will not do, drop to code inline: raw SQL, Python, JavaScript, a shell command, WASM, DuckDB. 7 of them - so you never hit a wall.

05 Under the hood

The canvas is not a metaphor. It is a plan.

Press Run and Duckle topologically sorts your graph into an ordered plan of DuckDB SQL stages. Five ideas keep it fast and honest.

Plan of stages. The graph becomes a numbered 1 to N order. Each node is one SQL step.
Lazy views. Most stages compile to a CREATE VIEW, not a table. Nothing computes until a sink pulls.
ATTACH, not copy. A database source runs ATTACH ... AS duckle_src and queries in place, then detaches.
Batched vs per-stage. One fast pass by default; flip materialization on a node for a safe on-disk checkpoint.
Reject ports. Quality nodes have a second output - failing rows fall to a dead-letter path.

Quick check

Most pipeline stages compile to a...

Right - stages are lazy views by default, so work only runs when a sink pulls from it. You opt into materializing a table only where you want a checkpoint.

06 The codebase

Seven pieces. One engine they share.

Duckle is a Rust workspace. Every surface calls into the same engine, so a pipeline runs identically everywhere. Click a piece to explore.

07 Complex pipelines

The ones that used to need a team.

Join four systems

Blend a warehouse, an app database, a CSV and an API - no staging, no copies.

01

4 sources

postgres · snowflake · csv · rest

Map + join

xf.map · xf.join

One table

snk.parquet

A billion rows, locally

Roll up a billion-row fact table on your laptop; ship only the small result to the cloud.

01

Big parquet

src.parquet 1e9

Aggregate

xf.groupby

Tiny result

snk.motherduck

Lakehouse time-travel

Read a DuckLake table as of two moments and diff them to see exactly what changed.

01

As-of reads

src.ducklake.diff

Summarize

xf.diffsummary

Report

snk.csv

07 What people build

Eight jobs, one tool.

01

Warehouse cost cut

Pre-aggregate locally; send only the rollup to Snowflake or BigQuery.

02

Database migration

Move Oracle or SQL Server to Postgres with upsert and CDC, verified to the row.

03

Lakehouse ingestion

Land files into Iceberg or DuckLake with schema control and time travel.

04

Reverse ETL

Push modeled data back out to a SaaS app - Slack, HubSpot, a webhook.

05

Data quality gate

Block a load when expectations, uniqueness or referential integrity fail.

06

AI / RAG prep

Chunk, redact PII and embed documents into a vector store for retrieval.

07

Observability rollups

Turn raw logs and metrics into SLOs, p95s and forecasts for dashboards.

08

Scheduled reports

Run nightly on a cron and drop a fresh Parquet or emailed CSV.

08 Ship it

Build it once. Run it five ways.

The same pipeline file runs from the studio, a terminal, a browser, a single binary, or an AI agent. Same engine, same result.

Runner. duckle run pipeline.json - a lean headless CLI for cron, systemd or CI.
Serve. duckle serve hosts a web console with runs history and a cron scheduler.
Build a binary. Export a self-contained executable with secrets bundled.
MCP. An LLM can list components, generate a validated pipeline, run it, read the logs.
Duckie. A local copilot (Qwen via llama.cpp) that builds pipelines for you, on your machine.

09 Know what changed

Data you can trust, and prove.

Drag the timeline. The result rewrites itself to the table's state as of that moment - real time travel, no backup to restore.

revenue by city · as of 2024-04-05

city	revenue

JanFebMarApr

09 The trust suite

Three more guarantees.

Column lineage. Trace any output column backward through every node to the exact source table and column.
.ducklock. A signed lockfile pins the schema and shape a run expects, so drift is caught, not discovered.
Contracts & review. Gate a pipeline on data contracts and run duckle review before you promote a change.

Last check

What lets Duckle show a table exactly as it was last month?

Correct - Duckle reads the table as of a past snapshot directly, and can diff two moments. No restore, no re-fetch.

10 You've seen the whole machine

Now godraw the boxes.

Sources read, transforms reshape, sinks write, quality guards, control orchestrates - and it all compiles to SQL you can see.

Download Duckle Read the docs Star on GitHub