Recent Parquet Improvements in Apache Spark

Apache Parquet is a very popular columnar file format supported by Apache Spark. In a typical Spark job, scanning Parquet files is sometimes one of the most time consuming steps, as it incurs high CPU and IO overhead. Therefore, optimizing Parquet scan performance is crucial to job latency and cost efficiency. Spark currently have two Parquet reader implementations: a vectorized one and a non-vectorized one. The former was implemented from scratch and offers much better performance than the latter. However, it currently doesn’t support complex types (e.g., array, list, map) at the moment and will fallback to the latter when encountering them. In addition to the reader implementation, predicate pushdown is also crucial to Parquet scan performance as it enables Spark to skip those data that do not satisfy the predicates, before the scan. Currently, Spark constructs predicates itself and rely on Parquet-MR to do the heavy lifting, which does the filtering based on various information such as statistics, dictionary, bloom filter or column index. This talk will go through two recent improvements for Parquet scan performance: 1) vectorized read support for complex types, which allows Spark to achieve 10x+ improvement when reading Parquet data of complex types, and 2) Parquet column index support, which enables Spark to leverage Parquet column index feature during predicate pushdown. Last but not least, Chao go over some future work items that can further enhance Parquet read performance. Connect with us: Website: https://databricks.com Facebook:   / databricksinc   Twitter:   / databricks   LinkedIn:   / data.  . Instagram:   / databricksinc  

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland
▶︎

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)
▶︎

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Making Apache Spark™ Better with Delta Lake
▶︎

Making Apache Spark™ Better with Delta Lake

Building InfluxDB 3.0 with Apache Arrow, DataFusion, Flight and Parquet
▶︎

Building InfluxDB 3.0 with Apache Arrow, DataFusion, Flight and Parquet

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan
▶︎

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Space Habitats: The Megastructures We’ll Call Home
▶︎

Space Habitats: The Megastructures We’ll Call Home

What is Apache Iceberg?
▶︎

What is Apache Iceberg?

Deep-Dive into Delta Lake
▶︎

Deep-Dive into Delta Lake

Something is jamming GPS over Europe. Here's what we found
▶︎

Something is jamming GPS over Europe. Here's what we found

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup
▶︎

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

Parquet File Format | Apache Spark
▶︎

Parquet File Format | Apache Spark

40K LEGENDS - TRAZYN THE INFINITE | Warhammer 40,000 Lore/History
▶︎

40K LEGENDS - TRAZYN THE INFINITE | Warhammer 40,000 Lore/History

Frequency Of God 963 Hz ✨ Attract Miracles, Divine Blessings & Deep Inner Peace In Your Life
▶︎

Frequency Of God 963 Hz ✨ Attract Miracles, Divine Blessings & Deep Inner Peace In Your Life

Apache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks
▶︎

Apache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks

John Mearsheimer & Sergey Karaganov: Atomschlag auf Europa zur Wiederherstellung der Abschreckung
▶︎

John Mearsheimer & Sergey Karaganov: Atomschlag auf Europa zur Wiederherstellung der Abschreckung

Eliminating Shuffles in Delete Update, and Merge
▶︎

Eliminating Shuffles in Delete Update, and Merge

Antoine Pitrou - Apache Parquet : the standard, efficient file format for tabular data
▶︎

Antoine Pitrou - Apache Parquet : the standard, efficient file format for tabular data

God Says:"STOP HERE — LISTEN AND HEAR ME SPEAK"/God Message Now/God Message
▶︎

God Says:"STOP HERE — LISTEN AND HEAR ME SPEAK"/God Message Now/God Message

What's Next for Apache Spark™ Including the Upcoming Release of Apache Spark 4.0
▶︎

What's Next for Apache Spark™ Including the Upcoming Release of Apache Spark 4.0

New Jellyfish Aquarium • Healing of Stress, Anxiety and Depressive States • Goodbye Insomnia #30
▶︎

New Jellyfish Aquarium • Healing of Stress, Anxiety and Depressive States • Goodbye Insomnia #30