Learn hub

Local-first data engineering, explained

Concepts and practical guides for building pipelines with Duckle and DuckDB. Short, opinionated, and grounded in how the product actually works.

ETL vs ELT

ETL (extract, transform, load) transforms data before it lands in the destination, usually on a separate processing tier. ELT (extract, load, transform) loads raw data first and transforms it inside the destination warehouse with SQL. ELT won as warehouses got cheap and fast, but it has a catch: every transformation, including the cheap ones, bills warehouse compute.

Duckle takes a third position. It runs a real columnar engine (DuckDB) on your machine, so you get ELT-style SQL transforms without renting a warehouse to run them. Extract from anywhere, transform locally at native speed, and load only the finished result into the warehouse, the lake, or a file. You read the generated SQL on every node, so it is never a black box.

Rule of thumb. Do the heavy joins, filtering and aggregation locally in Duckle; push only the rows you actually need into the expensive system downstream.

Change data capture on DuckDB

Reprocessing an entire table every run is slow and expensive. Change data capture (CDC) and incremental patterns move only what changed. Duckle ships several:

Watermark incremental load (xf.incremental): pick a monotonic column (an id or a timestamp). Duckle stores the high-water mark in the workspace and, on the next run, reads only rows past it. The watermark advances only on a fully successful run, so a failure or a partial preview never skips data.
DuckLake change feed (src.ducklake.changes): read the committed changes since the last consumed snapshot of a DuckLake table.
SCD Type 2: keep history with valid_from / valid_to / is_current columns.
Universal upsert / MERGE with optional delete propagation: mirror a source into a destination, applying inserts, updates and deletes driven by a change-type column - across embedded, network, warehouse and Mongo sinks.

See it end to end in the DuckLake CDC mirror use case (100k changes mirrored with upsert + delete in about 1.7 seconds).

Why local-first data tooling

Most data tools assume a cluster and a control plane. For a huge amount of real work - dev and debug loops, small-team analytics, one-off cleanups, data prep for AI - that is overkill. Local-first flips three things:

Cost. Your laptop is already paid for. No always-on warehouse, no per-row ingestion pricing.
Privacy. The data never leaves your machine. No telemetry, no third-party processor, works offline.
Speed. No network round-trips between extract, transform and load. A vectorized engine on local NVMe is astonishingly fast: Duckle incremental-loads 5 million rows in about a second.

When you outgrow one machine, point Duckle's output at the system that scales (a warehouse, an object store, a lakehouse). It will not pretend to be a cluster, and that honesty is the point.

Data prep for AI and RAG

Models inherit the quality of their inputs. RAG indexes quietly fill with duplicates, nulls, mixed encodings and PII. Duckle is built to scrub data before it reaches a vector store, on device:

Chunk long text (xf.ai.chunk), redact PII (xf.ai.pii - emails, phones, SSNs, cards), embed (xf.ai.embed, any OpenAI-compatible endpoint), then semantic-dedupe (xf.ai.dedupe).
Hybrid retrieval, locally: Vector Similarity Search (cosine / L2 / inner product via vss) and Full-Text Search (BM25 via fts) - no model API required.
Land it in pgvector, Pinecone, Qdrant, Weaviate or Milvus.

The three pure-local transforms (chunk, PII redact, dedupe) need no API at all. Full guide: Duckie AI & AI transforms.

Cut warehouse spend with local compute

Warehouses bill for compute and storage. If you scan a billion rows to produce a small daily rollup, you pay for the billion-row scan every time. Duckle lets you move that scan off the warehouse:

Read the raw data where it lives (object storage, a database export, a Parquet dump).
Join, filter and aggregate locally on DuckDB.
Upsert only the aggregate into the warehouse.

In the Snowflake cost-save use case, a 1-billion-row Parquet file plus SQLite and ADBC dimensions are crunched locally and only the revenue summary is written to Snowflake. The warehouse stores kilobytes, not terabytes.

Migrating from Talend, Fivetran or Airbyte

If you know those tools, Duckle maps cleanly:

You used	In Duckle
Talend jobs / tMap	The visual canvas and the Map node (main + up to 3 lookups, typed expressions, inline filter)
Fivetran / Airbyte connectors	360+ built-in components; connections are reusable, encrypted credential sets
dbt models	Run dbt directly with the `xf.dbt` node, or use native transforms
A managed control plane	Plain files in a folder you pick - diff, branch and review in Git; schedule locally or run headless
Per-row / per-connector pricing	Free and open source; you run it on hardware you already own

Start with the quickstart, then browse the use cases for end-to-end examples.

More guides are on the way. Want a specific topic covered? Open an issue on GitHub.