GPU Pipeline Optimization Explained | Async UDFs, CUDA Streams & Pinned Memory

🖥️ Whiteboard Deep Dive into GPU Pipeline Optimization In this deep dive, Srinu Lade / srinivas-lade (Software Engineer working on Daft’s execution engine) breaks down how to optimize GPU pipelines for ML and multimodal data processing. Using architectural diagrams, he explains why sequential CPU→GPU execution creates bottlenecks and how techniques like async UDFs, CUDA streams, and pinned memory unlock parallelism. What you’ll learn: How GPU workloads flow: host↔device transfers, VRAM, kernel execution Why Python UDFs are a bottleneck — and how async execution improves throughput Using CUDA streams to overlap transfers and compute for better utilization How GPU internals (H2D/D2H engines + compute units) enable pipeline parallelism Reducing OS overhead with pinned memory reuse in PyTorch workflows How Daft abstracts these optimizations into a high-level API for data/ML engineers Our aim is to abstract away these low-level complexities and provide a high-level API in Daft that delivers optimized GPU execution out-of-the-box for ML workloads. — Daft. Simple and reliable data processing for any modality and scale. Explore → https://daft.ai/ Build → https://docs.daft.ai/ Connect → https://www.daft.ai/slack Contribute → https://github.com/Eventual-Inc/Daft Learn → https://daft.ai/blog pip install daft

WHAT IS RIVITING?

WHAT IS RIVITING?

Lagrangian Mechanics: when theoretical physics got real

Lagrangian Mechanics: when theoretical physics got real

Optimize Your AI - Quantization Explained

Optimize Your AI - Quantization Explained

Attention in transformers, step-by-step | Deep Learning Chapter 6

Attention in transformers, step-by-step | Deep Learning Chapter 6

Asynchrony and CUDA Streams | CUDA C++ Class Part 2

Asynchrony and CUDA Streams | CUDA C++ Class Part 2

Advanced GPU computing: Efficient CPU-GPU memory transfers, CUDA streams

Advanced GPU computing: Efficient CPU-GPU memory transfers, CUDA streams

Search Is the Missing Layer of AI with Simon Eskildsen

Search Is the Missing Layer of AI with Simon Eskildsen

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

CUDA: New Features and Beyond | NVIDIA GTC 2025

CUDA: New Features and Beyond | NVIDIA GTC 2025

Andrew Kelley: A Practical Guide to Applying Data Oriented Design (DoD)

Andrew Kelley: A Practical Guide to Applying Data Oriented Design (DoD)

How Huawei Just Built an Impossible Chip

How Huawei Just Built an Impossible Chip

Let’s Handle 1 Million Requests per Second, It’s Scarier Than You Think!

Let’s Handle 1 Million Requests per Second, It’s Scarier Than You Think!

MCP vs API: Simplifying AI Agent Integration with External Data

MCP vs API: Simplifying AI Agent Integration with External Data

I Think They Are Lying To You

I Think They Are Lying To You

CUDA Crash Course (v2): Pinned Memory

CUDA Crash Course (v2): Pinned Memory

GPU Memory Alignment and Padding

GPU Memory Alignment and Padding

Why Inference is hard..

Why Inference is hard..

HW News - DRAM Companies Hit Trillions of Dollars, Bambu Open Source, NVIDIA Spark Concerns

HW News - DRAM Companies Hit Trillions of Dollars, Bambu Open Source, NVIDIA Spark Concerns

Mini Project: How to program a GPU? | CUDA C/C++

Mini Project: How to program a GPU? | CUDA C/C++

Something is jamming GPS over Europe. Here's what we found

Something is jamming GPS over Europe. Here's what we found