Implementing New Algorithm with CUDA Kernels | CUDA C++ Class Part 3

Welcome to NVIDIA’s Modern CUDA C++ Programming Class. You will learn how to implement new algorithms on the GPU using CUDA Kernels. This series is for C++ developers who want to use the GPU effectively—whether you’re new to CUDA and want the fastest path from “hello world” to real acceleration, or you’re an experienced CUDA programmer ready to modernize your code with the latest best practices. If you already know C++ and want to write clean, efficient, idiomatic GPU code, this course is for you. This video is part of a broader playlist containing three videos. We advise you to start from the first video. 📝 Part 1: • Accelerating Applications with Parallel Al... 📝 Part 2: • Asynchrony and CUDA Streams | CUDA C++ Cla... 📝 Full Playlist: • Modern CUDA C++ Programming Class ➡️ Link to the slides and Google Colab to run the exercise for free on the GPU: https://github.com/NVIDIA/accelerated... For the DLI version, please visit: https://learn.nvidia.com/courses/cour... Chapters: 00:00:00 Introduction 00:00:22 CUDA Kernels 00:17:30 Exercise Symmetry 00:18:32 Solution Symmetry 00:19:20 Exercise Row Symmetry 00:19:38 Solution Row Symmetry 00:21:38 Debugging Tools and Atomic Operations 00:36:38 Exercise Fix Histogram 00:36:59 Solution Fix Histogram 00:38:18 Privatized Histogram and Thread Scope 00:47:36 Exercise Fix Histogram 2 00:48:08 Solution Fix Histogram 2 00:49:35 SM and Shared Memory 00:55:45 Exercise Optimize Histogram 00:56:05 Solution Optimize Histogram 00:57:58 CUB 01:05:05 Exercise Cooperative Histogram 01:05:24 Solution Cooperative Histogram 01:06:06 Takeways 01:08:17 Final Review 01:10:43 Final Assessment

Accelerating Applications with Parallel Algorithms | CUDA C++ Class Part 1

Accelerating Applications with Parallel Algorithms | CUDA C++ Class Part 1

Asynchrony and CUDA Streams | CUDA C++ Class Part 2

Asynchrony and CUDA Streams | CUDA C++ Class Part 2

CUDA Live: Your Parallel Programming Guide

CUDA Live: Your Parallel Programming Guide

Advanced Workflow Webinar: Building Better Python Alphas

Advanced Workflow Webinar: Building Better Python Alphas

Stanford CS149 I Parallel Computing I 2023 I Lecture 7 - GPU architecture and CUDA Programming

Stanford CS149 I Parallel Computing I 2023 I Lecture 7 - GPU architecture and CUDA Programming

Building Efficient Sovereign AI Models for Europe With NVIDIA Nemotron

Building Efficient Sovereign AI Models for Europe With NVIDIA Nemotron

1,001 Ways to Accelerate Python with CUDA Kernels | NVIDIA GTC 2025

1,001 Ways to Accelerate Python with CUDA Kernels | NVIDIA GTC 2025

Lecture 44: NVIDIA Profiling

Lecture 44: NVIDIA Profiling

Self-Evolving Hermes Agents: Enterprise AI That Gets Better With Use | Nemotron Labs

Self-Evolving Hermes Agents: Enterprise AI That Gets Better With Use | Nemotron Labs

When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - CppCon 2024

When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - CppCon 2024

Getting Started with CUDA and Parallel Programming | NVIDIA GTC 2025 Session

Getting Started with CUDA and Parallel Programming | NVIDIA GTC 2025 Session

Ask the Experts: Nemotron 3 Ultra | Nemotron Labs

Ask the Experts: Nemotron 3 Ultra | Nemotron Labs

Simple Code, High Performance

Simple Code, High Performance

Getting Started With CUDA for Python Programmers

Getting Started With CUDA for Python Programmers

Object Oriented Programming | OOPS in Python | OOPS Tutorial | Intellipaat

Object Oriented Programming | OOPS in Python | OOPS Tutorial | Intellipaat

Intro to GPU Programming

Intro to GPU Programming

CUDA Programming

CUDA Programming

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

Yann LeCun: World Models: Enabling the next AI revolution

Yann LeCun: World Models: Enabling the next AI revolution

Chip design from the bottom up – Reiner Pope

Chip design from the bottom up – Reiner Pope