CUDA Crash Course: GPU Performance Optimizations Part 1
In this video we look at a step-by-step performance optimization of matrix multiplication in CUDA! Spreadsheet: https://docs.google.com/spreadsheets/... For code samples: http://github.com/coffeebeforearch For live content: / coffeebeforearch

▶︎
Getting Started with CUDA and Parallel Programming | NVIDIA GTC 2025 Session

▶︎
CUDA Crash Course: Handling Non-Perfect Input Sizes

▶︎
CUDA Crash Course: Tiled 1-D Convolution

▶︎
Modern GPU Architecture | GPU Programming

▶︎
Raph Levien: A Taste of GPU Compute

▶︎
02 CUDA Shared Memory

▶︎
CUDA Crash Course: 2-D Convolution

▶︎
CUDA Crash Course (v2): Vector Addition

▶︎
Simple Code, High Performance

▶︎
CUDA Part A: GPU Architecture Overview and CUDA Basics; Peter Messmer (NVIDIA)

▶︎
GPU Pipeline Optimization Explained | Async UDFs, CUDA Streams & Pinned Memory

▶︎
From Scratch: Shared Memory Atomics and Dynamic Allocation in CUDA

▶︎
CUDA: New Features and Beyond | NVIDIA GTC 2024

▶︎
Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)

▶︎
CUDA Crash Course: Cache Tiled Matrix Multiplication

▶︎
14 GPU Architecture 1

▶︎
Fundamentals of GPU Architecture: Introduction

▶︎
From Scratch: Cache Tiled Matrix Multiplication in CUDA

▶︎
Co-Creator of Haskell: Useless vs Useful Languages, Rust vs C, Functional Programming | Simon Jones

▶︎
