Accelerating Apache Parquet with metadata stores and specialized indexes using Apache DataFusion
It is a common misconception that querying Apache Parquet data is constrained to the basic metadata built into the format itself and thus is slower than querying proprietary formats. Parquet does contain standard Min/Max metadata, "Page Index" and Bloom filters, and using open source composable systems such as Apache DataFusion, it is possible to build sophisticated caches and specialized system specific indexes while retaining broad ecosystem compatibility. In this talk I review the structures built into Parquet for query acceleration, and demonstrate how to use a cache for parsed metadata, push row group and page pruning into a metadata store, and build a specialized index for multi-column primary keys. Speaker Bio: Andrew Lamb is a Staff Engineer at InfluxData, working in Rust on InfluxDB 3.0, focused on query processing, the Apache DataFusion query engine and the Apache Arrow ecosystem. He serves on the Apache DataFusion PMC (Current Chair), and on the Apache Arrow PMC, and actively contributes to DataFusion and the Arrow Rust implementations. He earned a BS and MEng in Course VI from MIT. More details are available at http://andrew.nerdnetworks.org/ Presentation Slides: https://docs.google.com/presentation/... Links to examples I refer to in the video https://github.com/apache/datafusion/... https://github.com/apache/datafusion/...

Apache DataFusion. Putting Theory Into Practice by Matt Butrovich | DC Systems 004

Intro to Apache DataFusion: Technology, Community, and Not Quite Enough Time

Cache Me If You Can: Decentralize Your Distributed Caches With Hollow - Viswanathan Ranganathan

Parquet File Format - Explained to a 5 Year Old!

Using Tokio for CPU-Bound Tasks (Works Really Well) May 2026

OaaS-IoT Tutorial at IPDPS 2026 Conference

The columnar roadmap: Apache Parquet and Apache Arrow

Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Why The Russian Accent Terrifies Everyone

Building InfluxDB 3.0 with the FDAP Stack: Apache Flight, DataFusion, Arrow and Parquet (Paul Dix)

Something is jamming GPS over Europe. Here's what we found

Backend web development - a complete overview

Conan O’Brien Delivers the Commencement Address | Harvard Commencement 2026

Informatiker bei Lufthansa Systems: Job zwischen Cybersecurity und Softwareentwicklung | alpha Uni

How AI agents & Claude skills work (Clearly Explained)

GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem

2026 01 31 Column Storage For The AIEra

2026 05 12 Introduction To Apache DataFusion Andrew Lamb

