Policy Gradient Methods: from REINFORCE to PPO

How do you do gradient ascent on a reward you can only sample — a black box you can't differentiate? A silent, animated explainer on policy-gradient methods in reinforcement learning. Covered: • The puzzle: optimizing an objective you can't differentiate • REINFORCE and the score-function estimator • The variance problem, and baselines as the cure • Actor-critic methods • Trust regions and PPO's clipped objective • Continuous control Built with Manim. No narration or music; everything is explained on screen.