The columnar roadmap: Apache Parquet and Apache Arrow
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance. In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board. Speaker JULIEN LE DEM Principal Engineer WeWork

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

Doing More with Data: An Introduction to Arrow for R Users

Where We’re Going, We Don’t Need Rows: Columnar Data Connectivity with Apache Arrow ADBC (Ian Cook)

Parquet File Format - Explained to a 5 Year Old!

Apache Arrow: High-Performance Columnar Data Framework (Wes McKinney)

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

Apache Arrow Meetup SF: Learn In Theory & In Practice

What Is Apache Arrow? Explained by Matt Topol | Dremio

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Jules Damji

Something is jamming GPS over Europe. Here's what we found

Deep Dive: Apache Spark Memory Management

Apache Parquet: Parquet file internals and inspecting Parquet file structure

Everyday I'm Shuffling - Tips for Writing Better Apache Spark Programs

Using Apache Arrow, Calcite and Parquet to build a Relational Cache | Dremio

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

What is a Vector Database? Powering Semantic Search & AI Applications

Implementing InfluxDB IOx, "from scratch" using Apache Arrow, DataFusion, and Rust by Andrew Lamb

Building the PERFECT Linux PC with Linus Torvalds

Lessons From the Field: Applying Best Practices to Your Apache Spark Applications - Silvio Fiorito

