To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets
Abstract: Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or understood the underlying rule (or both). Motivated by this challenge, we study an interpretable model where generalizing representations are understood analytically, and are easily distinguishable from the memorizing ones. Namely, we consider multi-layer perceptron (MLP) and Transformer architectures trained on modular arithmetic tasks, where ($\xi \cdot 100\%$) of labels are corrupted (\emph{i.e.} some results of the modular operations in the training set are incorrect). We show that (i) it is possible for the network to memorize the corrupted labels \emph{and} achieve $100\%$ generalization at the same time; (ii) the memorizing neurons can be identified and pruned, lowering the accuracy on corrupted data and improving the accuracy on uncorrupted data; (iii) regularization methods such as weight decay, dropout and BatchNorm force the network to ignore the corrupted data during optimization, and achieve $100\%$ accuracy on the uncorrupted dataset; and (iv) the effect of these regularization methods is (``mechanistically'') interpretable: weight decay and dropout force all the neurons to learn generalizing representations, while BatchNorm de-amplifies the output of memorizing neurons and amplifies the output of the generalizing ones. Finally, we show that in the presence of regularization, the training dynamics involves two consecutive stages: first, the network undergoes \emph{grokking} dynamics reaching high train \emph{and} test accuracy; second, it unlearns the memorizing representations, where the train accuracy suddenly jumps from $100\%$ to $100 (1-\xi)\%$.
- Hidden progress in deep learning: SGD learns parities near the computational limit. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=8XWP2ewX-im.
- P.L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525–536, 1998. doi: 10.1109/18.661502.
- The generalization-stability tradeoff in neural network pruning, 2020.
- Are we done with imagenet?, 2020.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
- Viv Cothey. Web-crawling reliability. Journal of the American Society for Information Science and Technology, 55(14):1228–1238, 2004. doi: https://doi.org/10.1002/asi.20078. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.20078.
- Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
- Modern Condensed Matter Physics. Cambridge University Press, 2019. doi: 10.1017/9781316480649.
- Andrey Gromov. Grokking modular arithmetic, 2023. URL https://arxiv.org/abs/2301.02679.
- Masking: A new perspective of noisy supervision. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31, 2018.
- Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
- Improving neural networks by preventing co-adaptation of feature detectors, 2012.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.
- Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2525–2534. PMLR, 10–15 Jul 2018.
- Review–a survey of learning from noisy labels. ECS Sensors Plus, 1(2):021401, jun 2022. doi: 10.1149/2754-2726/ac75f5. URL https://dx.doi.org/10.1149/2754-2726/ac75f5.
- Towards understanding grokking: An effective theory of representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 34651–34663. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf.
- Omnigrok: Grokking beyond algorithmic data. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=zDiHoIWa0q1.
- A tale of two circuits: Grokking as competition of sparse and dense subnetworks, 2023.
- On the importance of single directions for generalization, 2018.
- Progress measures for grokking via mechanistic interpretability, 2023.
- Norm-based capacity control in neural networks. In Peter Grünwald, Elad Hazan, and Satyen Kale (eds.), Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pp. 1376–1401, Paris, France, 03–06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/Neyshabur15.html.
- Exploring generalization in deep learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Paper.pdf.
- Predicting grokking long before it happens: A look into the loss landscape of models which grok, 2023.
- Running experiments on amazon mechanical turk. Judgment and Decision Making, 5(5):411–419, 2010. doi: 10.1017/S1930297500002205.
- Distinct types of eigenvector localization in networks. Scientific Reports, 6(1):18847, 2016. doi: 10.1038/srep18847. URL https://doi.org/10.1038/srep18847.
- Deep learning on a data diet: Finding important examples early in training. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 20596–20607. Curran Associates, Inc., 2021.
- Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022.
- Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–19, 2022. doi: 10.1109/TNNLS.2022.3152527.
- Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
- Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290, 2022.
- Explaining grokking through circuit efficiency, 2023.
- Dropout training as adaptive regularization. Advances in neural information processing systems, 26, 2013.
- Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
- Deep learning from noisy image labels with quality embedding. IEEE Transactions on Image Processing, 28(4):1909–1922, 2019. doi: 10.1109/TIP.2018.2877939.
- Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- The clock and the pizza: Two stories in mechanistic explanation of neural networks, 2023.
- Grokking phase transitions in learning local rules with gradient descent, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.