A big open question in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of comparable size? Furthermore, from among all solutions that fit the training data, how does GD find one that generalizes well (when such a well-generalizing solution exists)? In this talk we argue that the answer to both questions lies in the interaction of the gradients of different examples during training, and present a new theory based on this idea. The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others, why early stopping works, and why it is possible to learn from noisy labels. Moreover, since the theory provides a causal explanation of how GD finds a well-generalizing solution when one exists, it motivates a class of simple modifications to GD that attenuate memorization and improve generalization.
Towards Understanding the Generalization Mystery in Deep Learning https://educationweek.ieee.org/wp-content/themes/movedo/images/empty/thumbnail.jpg 150 150 ieeeeduweek https://secure.gravatar.com/avatar/cd5d0dea68aa71342bc8e4579c90502d?s=96&d=mm&r=g