Eng & Kwon - Scaling data workloads using the best of both worlds: pandas and Spark

www.pydata.org It is indisputable that pandas is oftentimes the keystone element in any data wrangling and analysis workloads. However, the challenge is that pandas is not meant for big data processing. This presents data practitioners a dilemma: should we downsample data and lose information? Or should we explore a distributed processing framework to scale out data workloads? An example of a mainstream distributed processing tool is Apache Spark. However, this means data practitioners now have to learn a new language, PySpark. Not all is bleak though: pandas API on Spark provides pandas equivalent APIs in PySpark. It allows pandas users to transition from single-node to distributed environment, by just simply swapping the pandas package with pyspark.pandas. On the other hand, existing PySpark users may wish to write their own custom user-defined functions (UDFs) that are not included in existing PySpark API. Pandas Function APIs, newly included in Spark 3.0+, allow users to apply arbitrary Python native functions, with pandas instances as the input and output against a PySpark dataframe. For instance, data scientists could use pandas function API to train a ML model based on each group of data using a single line of code. Co-presented by both a top open-source Apache Spark commiter and a hands-on data science consultant, this talk equips data analysts and scientists with the knowledge of scaling their data analysis workloads with implementation details and best practice guidance. Working knowledge of pandas, basic Spark, and machine learning is helpful. PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 00:00 Welcome! 00:10 Help us add time stamps or captions to this video! See the description for details. Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVi...

Betchel & Kiraly - a workbench for creating scikit-learn like parametric objects and libraries
▶︎

Betchel & Kiraly - a workbench for creating scikit-learn like parametric objects and libraries

Nicolas Makaroff - AI Scientist - Hands-On with Tabular Foundation Models | Pydata London 26
▶︎

Nicolas Makaroff - AI Scientist - Hands-On with Tabular Foundation Models | Pydata London 26

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains
▶︎

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

Pandas vs PySpark vs Polars: The DataFrame Explained Visually
▶︎

Pandas vs PySpark vs Polars: The DataFrame Explained Visually

Keynote-Martin O'Reilly - LLMs and AI agents demystified | Pydata London 26
▶︎

Keynote-Martin O'Reilly - LLMs and AI agents demystified | Pydata London 26

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker
▶︎

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

How To Think SO CLEARLY People Assume You're A Genius
▶︎

How To Think SO CLEARLY People Assume You're A Genius

Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI
▶︎

Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!
▶︎

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026
▶︎

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Inside the Mind of Anthropic CEO Dario Amodei | The Circuit | Extended Interview
▶︎

Inside the Mind of Anthropic CEO Dario Amodei | The Circuit | Extended Interview

How GPT, Claude, and Gemini are actually trained and served – Reiner Pope
▶︎

How GPT, Claude, and Gemini are actually trained and served – Reiner Pope

What to teach when AI writes the code | Rainer Stropek | TEDxLinz
▶︎

What to teach when AI writes the code | Rainer Stropek | TEDxLinz

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit
▶︎

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup
▶︎

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

The ONLY PySpark Tutorial You Will Ever Need.
▶︎

The ONLY PySpark Tutorial You Will Ever Need.

ASMR Addictive Fast Tapping Collection For Deep Sleep & Anxiety Relief (No Talking) — 2.5 Hours
▶︎

ASMR Addictive Fast Tapping Collection For Deep Sleep & Anxiety Relief (No Talking) — 2.5 Hours

Why The Best Engineers Are Solving Code Review Bottlenecks
▶︎

Why The Best Engineers Are Solving Code Review Bottlenecks

Apache Iceberg: What It Is and Why Everyone’s Talking About It.
▶︎

Apache Iceberg: What It Is and Why Everyone’s Talking About It.

🧹Watch me CLEAN DATA in Minutes with Python (+10 Tips for Complex Datasets)
▶︎

🧹Watch me CLEAN DATA in Minutes with Python (+10 Tips for Complex Datasets)