Beyond Empirical Risk Minimization: the lessons of deep learning
Mikhail Belkin, Professor, The Ohio State University - Department of Computer Science and Engineering, Department of Statistics, Center for Cognitive Science Abstract: "A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet, in modern practice, over-parametrized deep networks with near perfect fit on training data still show excellent test performance. This apparent contradiction points to troubling cracks in the conceptual foundations of machine learning. While classical analyses of Empirical Risk Minimization rely on balancing the complexity of predictors with training error, modern models are best described by interpolation. In that paradigm a predictor is chosen by minimizing (explicitly or implicitly) a norm corresponding to a certain inductive bias over a space of functions that fit the training data exactly. I will discuss the nature of the challenge to our understanding of machine learning and point the way forward to first analyses that account for the empirically observed phenomena. Furthermore, I will show how classical and modern models can be unified within a single "double descent" risk curve, which subsumes the classical U-shaped bias-variance trade-off. Finally, as an example of a particularly interesting inductive bias, I will show evidence that deep over-parametrized auto-encoders networks, trained with SGD, implement a form of associative memory with training examples as attractor states.

Shawe-Taylor and Rivasplata: Statistical Learning Theory - a Hitchhiker's Guide (NeurIPS 2018)

From Classical Statistics to Modern ML: the Lessons of Deep Learning - Mikhail Belkin

Panel Discussion: Open Questions in Theory of Learning

Machine Learning Lecture 16 "Empirical Risk Minimization" -Cornell CS4780 SP17

An Introduction to Graph Neural Networks: Models and Applications

AI Was Never About Helping You | Cory Doctorow

Language Models as World Models

Naftali Tishby - The Information Bottleneck View of Deep Learning: Why do we need it?

From Classical Statistics to Modern Machine Learning

Geoffrey Hinton: "Introduction to Deep Learning & Deep Belief Nets"

An Introduction to LSTMs in Tensorflow

ML Tutorial: Gaussian Processes (Richard Turner)

Representation Learning with Gaussian Processes

Lecture 3 | Learning, Empirical Risk Minimization, and Optimization

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Sanjeev Arora: Why do deep nets generalize, that is, predict well on unseen data

Scott Aaronson - The TRUTH About Quantum Computing

Towards Understanding Deep Classifiers though Neural Collapse

ICLR 2021 Keynote - "Geometric Deep Learning: The Erlangen Programme of ML" - M Bronstein

