The mechanistic basis of data dependence and abrupt learning in an in-context classification task
Abstract: Transformer models exhibit in-context learning: the ability to accurately predict the response to a novel query based on illustrative examples in the input sequence. In-context learning contrasts with traditional in-weights learning of query-output relationships. What aspects of the training data distribution and architecture favor in-context vs in-weights learning? Recent work has shown that specific distributional properties inherent in language, such as burstiness, large dictionaries and skewed rank-frequency distributions, control the trade-off or simultaneous appearance of these two forms of learning. We first show that these results are recapitulated in a minimal attention-only network trained on a simplified dataset. In-context learning (ICL) is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning. By identifying progress measures that precede in-context learning and targeted experiments, we construct a two-parameter model of an induction head which emulates the full data distributional dependencies displayed by the attention-based network. A phenomenological model of induction head formation traces its abrupt emergence to the sequential learning of three nested logits enabled by an intrinsic curriculum. We propose that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL, which is implemented by nested nonlinearities sequentially learned during training.
- Transformers learn to implement preconditioned gradient descent for in-context learning. arXiv preprint arXiv:2306.00297, 2023.
- In-context learning through the bayesian prism. arXiv preprint arXiv:2306.04891, 2023.
- What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
- Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
- Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.
- General-purpose in-context learning by meta-learning transformers. arXiv preprint arXiv:2212.04458, 2022.
- The omniglot challenge: a 3-year progress report. Current Opinion in Behavioral Sciences, 29:97–104, 2019.
- Transformers as algorithms: Generalization and implicit model selection in in-context learning. arXiv preprint arXiv:2301.07067, 2023.
- Are emergent abilities in large language models just in-context learning? arXiv preprint arXiv:2309.01809, 2023.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
- The transient nature of emergent in-context learning in transformers. arXiv preprint arXiv:2311.08360, 2023.
- Human-timescale adaptation in an open-ended task space. arXiv preprint arXiv:2301.07608, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp. 35151–35174. PMLR, 2023.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
- Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. arXiv preprint arXiv:2301.11916, 2023.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.