Linear Representation Transferability Hypothesis: Leveraging Small Models to Steer Large Models

Published 31 May 2025 in cs.LG, cs.AI, and cs.CL | (2506.00653v3)

Abstract: It has been hypothesized that neural networks with similar architectures trained on similar data learn shared representations relevant to the learning task. We build on this idea by extending the conceptual framework where representations learned across models trained on the same data can be expressed as linear combinations of a \emph{universal} set of basis features. These basis features underlie the learning task itself and remain consistent across models, regardless of scale. From this framework, we propose the \textbf{Linear Representation Transferability (LRT)} Hypothesis -- that there exists an affine transformation between the representation spaces of different models. To test this hypothesis, we learn affine mappings between the hidden states of models of different sizes and evaluate whether steering vectors -- directions in hidden state space associated with specific model behaviors -- retain their semantic effect when transferred from small to LLMs using the learned mappings. We find strong empirical evidence that such affine mappings can preserve steering behaviors. These findings suggest that representations learned by small models can be used to guide the behavior of large models, and that the LRT hypothesis may be a promising direction on understanding representation alignment across model scales.

Abstract PDF Upgrade to Chat

Summary

The paper presents the LRT Hypothesis, showing that neural networks learn universal basis features expressible as linear combinations across different scales.
It empirically validates that affine mappings can effectively transfer steering vectors from small models to large ones, preserving semantic behavior.
The study offers practical benefits in reducing computational costs and theoretical insights into the shared representational structures of neural networks.

Linear Representation Transferability Hypothesis: Leveraging Small Models to Steer Large Models

Abstract Overview

The paper introduces the Linear Representation Transferability (LRT) Hypothesis, proposing that neural networks of varying scales, when trained on similar datasets, can express learned representations as linear combinations of universal basis features consistent across architectures. An affine transformation is hypothesized to exist between the representational spaces of different models, preserving semantic effects of steering vectors transferred from smaller to larger models.

The LRT Hypothesis: A Conceptual Framework

The LRT Hypothesis posits that neural networks trained on similar data learn similar basis features, leading to the possibility of an affine transformation between their representational spaces. This hypothesis is rooted in the notion that shared fundamental structures (e.g., syntax, semantics) are captured by models, irrespective of their scale, thereby creating a universal representation space.

The framework suggests that the hidden states of models, regardless of size, are linear combinations of these universal basis features. This shared space allows the transfer of steering vectors derived from smaller models to larger ones through learned affine mappings. The conceptual framework supports the idea that representational subspaces can be mapped across models, enabling the reuse of steering vectors and effective steering of larger models via smaller ones.

Evaluation and Empirical Validation

The hypothesis is tested by learning affine mappings between hidden states of different model architectures and evaluating steering tasks. The methodology involves training linear mappings that capture the transformation between the representation spaces of models of varying sizes. Steering vectors, or directions associated with model behaviors, are derived from smaller models and mapped to larger ones using these transformations.

Empirical validation indicates that these mappings preserve the steering vectors’ semantic effects, supporting the viability of the LRT hypothesis. The findings suggest that smaller models serve as efficient sources of steering behaviors applicable to larger models, streamlining the process of understanding model behavior across scales.

Figure 1: Steering Gemma-9B directly vs. steering Gemma-9B with vectors learned with Gemma-2B, highlighting the correlation and MSE across various tasks.

Practical and Theoretical Implications

The LRT hypothesis has significant implications for both practical applications and theoretical understanding of neural networks:

Practical Implications: The ability to transfer steering vectors from small to large models could streamline development processes, reduce computational costs, and enhance efficiency in training large-scale models. This capability can significantly impact areas like model tuning and optimization.
Theoretical Implications: The hypothesis enhances our understanding of the universal structures underlying neural network representations. It opens avenues for further research on the geometry and linearity of model spaces, advancing the theoretical foundation of neural networks.

Future Research Directions

The study encourages exploration in several promising directions:

Scalability Limits: Understanding the boundaries of linear transferability across model sizes, particularly how effectiveness varies with increasing model scale differences.
Cross-architecture Transferability: Investigating whether linear representational transfers can be extended to different model architectures, providing a deeper understanding of shared model behaviors across diverse architectures.
Universal Representation Space: Further research into the existence and properties of a universal representation space could provide insights into model generalization and representational alignment.

Conclusion

The research presents concrete evidence for the LRT hypothesis, showing its efficacy in transferring steering vectors across model scales via affine mappings. The findings underscore the potential of small models in serving as templates or benchmarks for larger models, facilitating a deeper understanding of neural model behavior and an efficient approach to model steering. The implications for model training, inference, and architectural design are vast, pointing to a future where insights developed from small-scale models can be seamlessly integrated and applied to larger-scale neural networks.

Markdown Report Issue