Exploring the LLM Journey from Cognition to Expression with Linear Representations
Abstract: This paper presents an in-depth examination of the evolution and interplay of cognitive and expressive capabilities in LLMs, with a specific focus on Baichuan-7B and Baichuan-33B, an advanced bilingual (Chinese and English) LLM series. We define and explore the model's cognitive and expressive capabilities through linear representations across three critical phases: Pretraining, Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF). Cognitive capability is defined as the quantity and quality of information conveyed by the neuron output vectors within the network, similar to the neural signal processing in human cognition. Expressive capability is defined as the model's capability to produce word-level output. Our findings unveil a sequential development pattern, where cognitive abilities are largely established during Pretraining, whereas expressive abilities predominantly advance during SFT and RLHF. Statistical analyses confirm a significant correlation between the two capabilities, suggesting that cognitive capacity may limit expressive potential. The paper also explores the theoretical underpinnings of these divergent developmental trajectories and their connection to the LLMs' architectural design. Moreover, we evaluate various optimization-independent strategies, such as few-shot learning and repeated sampling, which bridge the gap between cognitive and expressive capabilities. This research reveals the potential connection between the hidden space and the output space, contributing valuable insights into the interpretability and controllability of their training processes.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
- A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399, 2016.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. URL https://arxiv.org/abs/2309.10305.
- Evaluating hallucinations in chinese large language models. arXiv preprint arXiv:2310.03368, 2023.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Arcee’s mergekit: A toolkit for merging large language models. arXiv preprint arXiv:2403.13257, 2024.
- Linearity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124, 2023.
- Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pp. 2668–2677. PMLR, 2018.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=DeG07_TcZvT.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
- Aligning large language models with human preferences through representation engineering. arXiv preprint arXiv:2312.15997, 2023.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013a.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013b.
- The strange geometry of skip-gram with negative sampling. In Empirical Methods in Natural Language Processing, 2017.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36, 2024.
- Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
- Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 2023.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.