Parallel histogram computation on GPUs in CUDA (part 2)
In this video we continue our discussion on parallel histogram computation, which is a commonly used parallel programming pattern. We discuss: 1) Using registers/thread to reduce the number of atomic updates that are required to be done by the kernel (application of register tiling) 2) How to optimally utilise registers in CUDA kernels (to make sure we are not wasting any bits in registers) 3) How to use warp level primitives to perform block level reduction

▶︎
Parallel Histogram computation on GPUs in CUDA

▶︎
Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

▶︎
Building a Website from Scratch, Part 10: Hero Section, Background Images, Gradients & a CSS Trick

▶︎
Parallel merge algorithm on GPUs using CUDA

▶︎
Stencil computation pattern in GPU programming CUDA

▶︎
Mini Project: How to program a GPU? | CUDA C/C++

▶︎
CAT 2026 | DILR 03 | Medium Set SIMPLE SOLUTION

▶︎
Next Smaller Element - Monotonic Stack Problem #1

▶︎
LAWYER: If Cops Ask "Where Are You Coming From?" - Say These Words

▶︎
Why The Russian Accent Terrifies Everyone

▶︎
Most Ridiculous Worker Mistakes Caught on Camera

▶︎
Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

▶︎
American Reacts to "Why the World Thinks Americans Are Brainwashed"

▶︎
JANITOR vs THE BIGGEST GUYS IN THE GYM. They Didn’t Expect THAT

▶︎
Nobody Breaks Celebrities Like Mr.Bean!

▶︎
Nobody Breaks Celebrities Like Rowan Atkinson

▶︎
I finally understood why everything in our universe is made of imaginary numbers! (My mind is blown)

▶︎
Is the AfD a threat to Germany? Mehdi Hasan & Maximilian Krah | Head to Head

▶︎
What rebuilding AlphaGo teaches us about self-play, RL, and future of LLMs - Eric Jang

▶︎
