Why use DuckDB in your data pipelines ft. Niels Claeys
Talk from Niels Claeys (@Dataminded ) at one of the MotherDuck/DuckDB meetups that happened in September in Belgium, Leuven. Thanks @dataroots, for hosting this! ☁️🦆 Start using DuckDB in the Cloud for FREE with MotherDuck : https://hubs.la/Q02QnFR40 Niels's Links : Linkedin : / nielsclaeys Medium : / niels.claeys ➡️ Follow Us LinkedIn: / 8192. . Twitter : / motherduck Blog: https://motherduck.com/blog/ #duckdb #motherduckdb #motherduckduckdb #dataengineering This presentation explores using DuckDB in your data pipelines as a modern, efficient alternative to traditional data engineering tools. We begin by outlining a typical batch data platform architecture, where data is ingested into a data lake (like S3 or Azure Blob Storage) and processed using Spark for all ETL tasks. This sets the stage for a common challenge: the inefficiency of using a distributed engine like Spark for the 80-90% of use cases that involve small or medium-sized data, defined here as up to 100GB. We propose a more efficient tech stack combining DBT and DuckDB to replace Spark for a majority of data processing workloads. This approach empowers data analysts, who are proficient in SQL, to build their own data pipelines, freeing up specialized data engineering teams. DBT brings software engineering best practices like modularization and documentation to SQL, while DuckDB acts as a high-performance, in-process SQL engine that excels at querying data directly from Parquet files in your data lake, making it a perfect fit for analytics engineering. Learn about the practical implementation using the `dbt-duckdb` adapter, which enables a seamless workflow. A key feature is its ability to read from external storage, making DBT with DuckDB a drop-in replacement for existing Spark jobs without altering the input or output interfaces. This interoperability allows for a gradual, use-case-by-use-case migration, offering a similar development experience locally and on remote environments like Kubernetes. Finally, we dive into a detailed performance benchmark comparing DuckDB vs Spark vs Trino using the 100GB TPCDS dataset. The results demonstrate that for medium-sized data, DuckDB is significantly faster and more cost-efficient, finishing over half its queries before Spark completes its first one due to lower overhead. This analysis validates using DBT with DuckDB as a powerful and performant solution for many data pipelines, reserving Spark for truly large-scale or complex processing needs. Watch with full transcript & resources: https://motherduck.com/videos/why-use...

Designing Data-intensive Applications with Martin Kleppmann

Analytics for not-so-big data with DuckDB - David Ostrovsky - NDC Oslo 2025

Data Analytics with Microsoft Fabric

Apache Iceberg: What It Is and Why Everyone’s Talking About It.

Why should you care about DuckDB? ft. Mihai Bojin

Postgres & DuckDB with Josef Machytka

Introducing DuckLake

DuckDB: Crunching Data Anywhere, From Laptops to Servers • Gabor Szarnyas • GOTO 2024

DuckDB Experiments: Peeking into the Future of Analytics ft. Christophe Blefari

Understanding DuckLake: A Table Format with a Modern Architecture

Data Engineering with Dagster - Talk Python to Me Ep.454
![Hannes Mühleisen - Data Wrangling [for Python or R] Like a Boss With DuckDB](https://i.ytimg.com/vi/GELhdezYmP0/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLCaxQMHrq266vbSWFd0G7VJ9M9qUw)
Hannes Mühleisen - Data Wrangling [for Python or R] Like a Boss With DuckDB

Motherduck: Data Analytics in the Post Big-Data Era – Jordan Tigani | Compass Tech Summit 2023

DuckDB & Iceberg : The Future of Lightweight Data Management

DuckDB: Supercharging Your Data Crunching by Richard Wesley

The DuckLake Lakehouse: From Getting Started to Going Fast

Building Agentic AI Applications on Snowflake: From Query to MCP | BUILD 2025

Data Warehouses are Gilded Cages What Comes Next | Motherduck

DuckDB Co-Creator Hannes Mühleisen on Why Single-Node Beats Distributed

