Scalable Data Ingestion Architecture Using Airflow and Spark | Komodo Health
Get the slides: https://www.datacouncil.ai/talks/scal... ABOUT THE TALK: This is an experience report on implementing and moving to a scalable data ingestion architecture. The requirements were to process tens of terabytes of data coming from several sources with data refresh cadences varying from daily to annual. The main challenge is that each provider has their own quirks in schemas and delivery processes. To achieve this we use Apache Airflow to organize the workflows and to schedule their execution, including developing custom Airflow hooks and operators to handle similar tasks in different pipelines. We are running on AWS using Apache Spark to horizontally scale the data processing and Kubernetes for container management. We will explain the reasons for this architecture, and we will also share the pros and cons we have observed when working with these technologies. Furthermore, we will explain how this approach has simplified the process of bringing in new data sources and considerably reduced the maintenance and operation overhead, but also the challenges that we have had during this transition. ABOUT THE SPEAKERS: Dr. Johannes Leppä is a Data Engineer building scalable solutions for ingesting complex data sets at Komodo Health. Johannes is interested in the design of distributed systems and intricacies in the interactions between different technologies. He claims not to be lazy, but gets most excited about automating his work. Prior to data engineering he conducted research in the field of aerosol physics at the California Institute of Technology, and holds a PhD in physics from the University of Helsinki. Johannes is passionate about metal: wielding it, forging it and, especially, listening to it. ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups. FOLLOW DATA COUNCIL: Twitter: / datacouncilai LinkedIn: / datacouncil-ai Facebook: / datacouncilai Eventbrite: https://www.eventbrite.com/o/data-cou... - 🎟️ GET YOUR TICKET TO AI COUNCIL 2026 🎟️ Meet the world's top AI infrastructure minds where architects of AI share what works. Three days of high-quality technical talks and meaningful interactions. → https://aicouncil.com/sf-2026 ⚡ FIND US: X: https://x.com/AICouncilConf LinkedIn: / aicouncilconf Website: https://aicouncil.com/

The history and anatomy of Apache Superset

The Newcomer's Guide to Airflow's Architecture

Data Reliability Engineering: A New Approach to Data Quality | Bigeye

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

PyCon.DE 2017 Tamara Mendt - Modern ETL-ing with Python and Airflow (and Spark)

Making Apache Spark™ Better with Delta Lake

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Airflow on Kubernetes - Scaling DAG Workflows | Daniel Imberman, Seth Edwards @ PyBay2018

2026 05 12 Distributed DataFusion Gene Bordegaray and Jayant Shrivastava

Data Engineering Principles - Build frameworks not pipelines - Gatis Seja

Airflow DAG: Coding your first DAG for Beginners

Building (Better) Data Pipelines with Apache Airflow

Deep dive in to the Airflow scheduler

Data Pipeline Frameworks: The Dream and the Reality | Beeswax

Something is jamming GPS over Europe. Here's what we found

How Data Engineering Works

Michał Karzyński - Developing elegant workflows in Python code with Apache Airflow

Airflow Tutorial for Beginners - Full Course in 2 Hours 2022

Running Apache Airflow Reliably with Kubernetes | Astronomer

