Local-first data engineering, explained
Concepts and practical guides for building pipelines with Duckle and DuckDB. Short, opinionated, and grounded in how the product actually works.
ETL vs ELT
ETL (extract, transform, load) transforms data before it lands in the destination, usually on a separate processing tier. ELT (extract, load, transform) loads raw data first and transforms it inside the destination warehouse with SQL. ELT won as warehouses got cheap and fast, but it has a catch: every transformation, including the cheap ones, bills warehouse compute.
Duckle takes a third position. It runs a real columnar engine (DuckDB) on your machine, so you get ELT-style SQL transforms without renting a warehouse to run them. Extract from anywhere, transform locally at native speed, and load only the finished result into the warehouse, the lake, or a file. You read the generated SQL on every node, so it is never a black box.
Change data capture on DuckDB
Reprocessing an entire table every run is slow and expensive. Change data capture (CDC) and incremental patterns move only what changed. Duckle ships several:
- Watermark incremental load (
xf.incremental): pick a monotonic column (an id or a timestamp). Duckle stores the high-water mark in the workspace and, on the next run, reads only rows past it. The watermark advances only on a fully successful run, so a failure or a partial preview never skips data. - DuckLake change feed (
src.ducklake.changes): read the committed changes since the last consumed snapshot of a DuckLake table. - SCD Type 2: keep history with
valid_from/valid_to/is_currentcolumns. - Universal upsert / MERGE with optional delete propagation: mirror a source into a destination, applying inserts, updates and deletes driven by a change-type column - across embedded, network, warehouse and Mongo sinks.
See it end to end in the DuckLake CDC mirror use case (100k changes mirrored with upsert + delete in about 1.7 seconds).
Why local-first data tooling
Most data tools assume a cluster and a control plane. For a huge amount of real work - dev and debug loops, small-team analytics, one-off cleanups, data prep for AI - that is overkill. Local-first flips three things:
- Cost. Your laptop is already paid for. No always-on warehouse, no per-row ingestion pricing.
- Privacy. The data never leaves your machine. No telemetry, no third-party processor, works offline.
- Speed. No network round-trips between extract, transform and load. A vectorized engine on local NVMe is astonishingly fast: Duckle incremental-loads 5 million rows in about a second.
When you outgrow one machine, point Duckle's output at the system that scales (a warehouse, an object store, a lakehouse). It will not pretend to be a cluster, and that honesty is the point.
Data prep for AI and RAG
Models inherit the quality of their inputs. RAG indexes quietly fill with duplicates, nulls, mixed encodings and PII. Duckle is built to scrub data before it reaches a vector store, on device:
- Chunk long text (
xf.ai.chunk), redact PII (xf.ai.pii- emails, phones, SSNs, cards), embed (xf.ai.embed, any OpenAI-compatible endpoint), then semantic-dedupe (xf.ai.dedupe). - Hybrid retrieval, locally: Vector Similarity Search (cosine / L2 / inner product via
vss) and Full-Text Search (BM25 viafts) - no model API required. - Land it in pgvector, Pinecone, Qdrant, Weaviate or Milvus.
The three pure-local transforms (chunk, PII redact, dedupe) need no API at all. Full guide: Duckie AI & AI transforms.
Cut warehouse spend with local compute
Warehouses bill for compute and storage. If you scan a billion rows to produce a small daily rollup, you pay for the billion-row scan every time. Duckle lets you move that scan off the warehouse:
- Read the raw data where it lives (object storage, a database export, a Parquet dump).
- Join, filter and aggregate locally on DuckDB.
- Upsert only the aggregate into the warehouse.
In the Snowflake cost-save use case, a 1-billion-row Parquet file plus SQLite and ADBC dimensions are crunched locally and only the revenue summary is written to Snowflake. The warehouse stores kilobytes, not terabytes.
Migrating from Talend, Fivetran or Airbyte
If you know those tools, Duckle maps cleanly:
| You used | In Duckle |
|---|---|
| Talend jobs / tMap | The visual canvas and the Map node (main + up to 3 lookups, typed expressions, inline filter) |
| Fivetran / Airbyte connectors | 329 built-in components; connections are reusable, encrypted credential sets |
| dbt models | Run dbt directly with the xf.dbt node, or use native transforms |
| A managed control plane | Plain files in a folder you pick - diff, branch and review in Git; schedule locally or run headless |
| Per-row / per-connector pricing | Free and open source; you run it on hardware you already own |
Start with the quickstart, then browse the use cases for end-to-end examples.