Building Robust ETL Pipelines with Apache Spark - Xiao Li
Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we'll take a deep dive into the technical details of how Apache Spark "reads" data and discuss how Spark 2.2's flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines. Overview: 1) What’s an ETL Pipeline? 2) Using Spark SQL for ETL Extract: Dealing with Dirty Data (Bad Records or Files) Extract: Multi-line JSON/CSV Support Transformation: High-order functions in SQL Load: Unified write paths and interfaces 3) New Features in Spark 2.3 Performance (Data Source API v2, Python UDF) View slides: https://www.slideshare.net/databricks... Related articles: Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark https://databricks.com/blog/2016/12/0... Writing Data Engineering Pipelines in Apache Spark on Databricks https://databricks.com/blog/2016/09/0... About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Read more here: https://databricks.com/product/unifie... Connect with us: Website: https://databricks.com Facebook: / databricksinc Twitter: / databricks LinkedIn: / databricks Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-nam...

Apache Spark Architecture - EXPLAINED!

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

What is Spark? (Visual Explanation)

Get Rid of Traditional ETL, Move to Spark! (Bas Geerdink)

Data + AI Summit Keynote 2026 | Day 1

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Jules Damji

Data + AI Summit Keynote 2026 | Day 2

Apache Spark Was Hard Until I Learned These 30 Concepts!

Data Engineering Principles - Build frameworks not pipelines - Gatis Seja

Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)

PyCon.DE 2017 Tamara Mendt - Modern ETL-ing with Python and Airflow (and Spark)

Lessons From the Field: Applying Best Practices to Your Apache Spark Applications - Silvio Fiorito

Apache Spark - Computerphile

China’s Secret | The Most Unbelievable Megaprojects in China | 4K Travel Documentary

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

Distributed Machine Learning with Apache Spark / PySpark MLlib

Databricks Tutorial | Databricks Free Edition Tutorial with End-to-End Data + AI Project

