The Statistics Seminar speaker for Wednesday, October 23, 2019, is Mikhail Belkin, a professor in the Department of Computer Science and Engineering and the Department of Statistics at the Ohio State University. He received his PhD from the University of Chicago in Mathematics in 2003. His research focuses on understanding the fundamental structure in data, the principles of recovering these structures and their computational, mathematical and statistical properties. This understanding, in turn, leads to algorithms for dealing with real-world data. His work includes algorithms such as Laplacian Eigenmaps and Manifold Regularization which use ideas of classical differential geometry for analyzing non-linear high-dimensional data and have been widely used in applications. Prof. Belkin is a recipient of an NSF Career Award and a number of best paper and other awards. He has served on the editorial boards of the Journal of Machine Learning Research and IEEE PAMI.
Talk: Beyond Empirical Risk Minimization: the lessons of deep learning
Abstract: “A model with zero training error is overfit to the training data and will typically generalize poorly” goes statistical textbook wisdom. Yet, in modern practice, over-parametrized deep networks with near perfect fit on training data still show excellent test performance. This apparent contradiction points to troubling cracks in the conceptual foundations of machine learning. While classical analyses of Empirical Risk Minimization rely on balancing the complexity of predictors with training error, modern models are best described by interpolation. In that paradigm a predictor is chosen by minimizing (explicitly or implicitly) a norm corresponding to a certain inductive bias over a space of functions that fit the training data exactly. I will discuss the nature of the challenge to our understanding of machine learning and point the way forward to first analyses that account for the empirically observed phenomena. Furthermore, I will show how classical and modern models can be unified within a single "double descent" risk curve, which subsumes the classical U-shaped bias-variance trade-off.
Finally, I will discuss the important implications for optimization, showing, in particular, how the lessons of deep learning can be used to accelerate kernel machines.