Why use DuckDB in your data pipelines ft. Niels Claeys

Talk from Niels Claeys (@Dataminded ) at one of the MotherDuck/DuckDB meetups that happened in September in Belgium, Leuven. Thanks @dataroots, for hosting this! ☁️🦆 Start using DuckDB in the Cloud for FREE with MotherDuck : https://hubs.la/Q02QnFR40 Niels's Links : Linkedin :   / nielsclaeys   Medium :   / niels.claeys   ➡️ Follow Us LinkedIn:   / 8192.  . Twitter :   / motherduck   Blog: https://motherduck.com/blog/ #duckdb #motherduckdb #motherduckduckdb #dataengineering This presentation explores using DuckDB in your data pipelines as a modern, efficient alternative to traditional data engineering tools. We begin by outlining a typical batch data platform architecture, where data is ingested into a data lake (like S3 or Azure Blob Storage) and processed using Spark for all ETL tasks. This sets the stage for a common challenge: the inefficiency of using a distributed engine like Spark for the 80-90% of use cases that involve small or medium-sized data, defined here as up to 100GB. We propose a more efficient tech stack combining DBT and DuckDB to replace Spark for a majority of data processing workloads. This approach empowers data analysts, who are proficient in SQL, to build their own data pipelines, freeing up specialized data engineering teams. DBT brings software engineering best practices like modularization and documentation to SQL, while DuckDB acts as a high-performance, in-process SQL engine that excels at querying data directly from Parquet files in your data lake, making it a perfect fit for analytics engineering. Learn about the practical implementation using the `dbt-duckdb` adapter, which enables a seamless workflow. A key feature is its ability to read from external storage, making DBT with DuckDB a drop-in replacement for existing Spark jobs without altering the input or output interfaces. This interoperability allows for a gradual, use-case-by-use-case migration, offering a similar development experience locally and on remote environments like Kubernetes. Finally, we dive into a detailed performance benchmark comparing DuckDB vs Spark vs Trino using the 100GB TPCDS dataset. The results demonstrate that for medium-sized data, DuckDB is significantly faster and more cost-efficient, finishing over half its queries before Spark completes its first one due to lower overhead. This analysis validates using DBT with DuckDB as a powerful and performant solution for many data pipelines, reserving Spark for truly large-scale or complex processing needs. Watch with full transcript & resources: https://motherduck.com/videos/why-use...