CUDA Crash Course: GPU Performance Optimizations Part 1

In this video we look at a step-by-step performance optimization of matrix multiplication in CUDA! Spreadsheet: https://docs.google.com/spreadsheets/... For code samples: http://github.com/coffeebeforearch For live content: / coffeebeforearch

Getting Started with CUDA and Parallel Programming | NVIDIA GTC 2025 Session

Getting Started with CUDA and Parallel Programming | NVIDIA GTC 2025 Session

CUDA Crash Course: Handling Non-Perfect Input Sizes

CUDA Crash Course: Handling Non-Perfect Input Sizes

CUDA Crash Course: Tiled 1-D Convolution

CUDA Crash Course: Tiled 1-D Convolution

Modern GPU Architecture | GPU Programming

Modern GPU Architecture | GPU Programming

Raph Levien: A Taste of GPU Compute

Raph Levien: A Taste of GPU Compute

02 CUDA Shared Memory

02 CUDA Shared Memory

CUDA Crash Course: 2-D Convolution

CUDA Crash Course: 2-D Convolution

CUDA Crash Course (v2): Vector Addition

CUDA Crash Course (v2): Vector Addition

Simple Code, High Performance

Simple Code, High Performance

CUDA Part A: GPU Architecture Overview and CUDA Basics; Peter Messmer (NVIDIA)

CUDA Part A: GPU Architecture Overview and CUDA Basics; Peter Messmer (NVIDIA)

GPU Pipeline Optimization Explained | Async UDFs, CUDA Streams & Pinned Memory

GPU Pipeline Optimization Explained | Async UDFs, CUDA Streams & Pinned Memory

From Scratch: Shared Memory Atomics and Dynamic Allocation in CUDA

From Scratch: Shared Memory Atomics and Dynamic Allocation in CUDA

CUDA: New Features and Beyond | NVIDIA GTC 2024

CUDA: New Features and Beyond | NVIDIA GTC 2024

Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)

Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)

CUDA Crash Course: Cache Tiled Matrix Multiplication

CUDA Crash Course: Cache Tiled Matrix Multiplication

14 GPU Architecture 1

14 GPU Architecture 1

Fundamentals of GPU Architecture: Introduction

Fundamentals of GPU Architecture: Introduction

From Scratch: Cache Tiled Matrix Multiplication in CUDA

From Scratch: Cache Tiled Matrix Multiplication in CUDA

Co-Creator of Haskell: Useless vs Useful Languages, Rust vs C, Functional Programming | Simon Jones

Co-Creator of Haskell: Useless vs Useful Languages, Rust vs C, Functional Programming | Simon Jones

Finding and Fixing Slow Code // Ray Tracing series

Finding and Fixing Slow Code // Ray Tracing series