Building Robust ETL Pipelines with Apache Spark - Xiao Li

Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we'll take a deep dive into the technical details of how Apache Spark "reads" data and discuss how Spark 2.2's flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines. Overview: 1) What’s an ETL Pipeline? 2) Using Spark SQL for ETL Extract: Dealing with Dirty Data (Bad Records or Files) Extract: Multi-line JSON/CSV Support Transformation: High-order functions in SQL Load: Unified write paths and interfaces 3) New Features in Spark 2.3 Performance (Data Source API v2, Python UDF) View slides: https://www.slideshare.net/databricks... Related articles: Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark https://databricks.com/blog/2016/12/0... Writing Data Engineering Pipelines in Apache Spark on Databricks https://databricks.com/blog/2016/09/0... About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Read more here: https://databricks.com/product/unifie... Connect with us: Website: https://databricks.com Facebook: / databricksinc Twitter: / databricks LinkedIn: / databricks Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-nam...

Apache Spark Architecture - EXPLAINED!

Apache Spark Architecture - EXPLAINED!

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

What is Spark? (Visual Explanation)

What is Spark? (Visual Explanation)

Get Rid of Traditional ETL, Move to Spark! (Bas Geerdink)

Get Rid of Traditional ETL, Move to Spark! (Bas Geerdink)

Data + AI Summit Keynote 2026 | Day 1

Data + AI Summit Keynote 2026 | Day 1

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Jules Damji

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Jules Damji

Data + AI Summit Keynote 2026 | Day 2

Data + AI Summit Keynote 2026 | Day 2

Apache Spark Was Hard Until I Learned These 30 Concepts!

Apache Spark Was Hard Until I Learned These 30 Concepts!

Data Engineering Principles - Build frameworks not pipelines - Gatis Seja

Data Engineering Principles - Build frameworks not pipelines - Gatis Seja

Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)

Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)

PyCon.DE 2017 Tamara Mendt - Modern ETL-ing with Python and Airflow (and Spark)

PyCon.DE 2017 Tamara Mendt - Modern ETL-ing with Python and Airflow (and Spark)

Lessons From the Field: Applying Best Practices to Your Apache Spark Applications - Silvio Fiorito

Lessons From the Field: Applying Best Practices to Your Apache Spark Applications - Silvio Fiorito

Apache Spark - Computerphile

Apache Spark - Computerphile

China’s Secret | The Most Unbelievable Megaprojects in China | 4K Travel Documentary

China’s Secret | The Most Unbelievable Megaprojects in China | 4K Travel Documentary

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

Distributed Machine Learning with Apache Spark / PySpark MLlib

Distributed Machine Learning with Apache Spark / PySpark MLlib

Databricks Tutorial | Databricks Free Edition Tutorial with End-to-End Data + AI Project

Databricks Tutorial | Databricks Free Edition Tutorial with End-to-End Data + AI Project

The ONLY PySpark Tutorial You Will Ever Need.

The ONLY PySpark Tutorial You Will Ever Need.