Gradient Descent Optimizers: from Momentum to AdamW

A silent, animated walkthrough of the optimizers that train modern neural networks — built up one idea at a time, from plain gradient descent to AdamW. Covered: • Why plain SGD stalls and oscillates in ravines • Momentum — accumulating velocity to power through • RMSProp — per-parameter adaptive step sizes • Adam — momentum + adaptive scaling combined • AdamW — decoupled weight decay, and why it beats plain Adam Built with Manim. No narration or music; everything is explained on screen.