GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization

Published 29 Jan 2026 in cs.LG and cs.CL | (2601.22095v1)

Abstract: The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the outputs of the Feed-Forward Network (FFN) and attention layers as update directions in optimization. Building on this perspective, we introduce GeoNorm, a novel method that replaces standard normalization with geodesic updates on the manifold. Furthermore, analogous to learning rate schedules, we propose a layer-wise update decay for the FFN and attention components. Comprehensive experiments demonstrate that GeoNorm consistently outperforms existing normalization methods in Transformer models. Crucially, GeoNorm can be seamlessly integrated into standard Transformer architectures, achieving performance improvements with negligible additional computational cost.

Abstract PDF Upgrade to Chat

Summary

The paper introduces GeoNorm, a geodesic optimization technique that reinterprets Transformer normalization as spherical manifold updates.
It replaces conventional Pre-Norm and Post-Norm with geodesic updates, ensuring smoother, more stable training dynamics.
Empirical results demonstrate GeoNorm’s superiority with improved convergence and loss metrics across diverse model configurations and training lengths.

GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization

Introduction

The paper "GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization" (2601.22095) addresses a significant issue in Transformer architectures concerning the placement of normalization layers, specifically contrasting Pre-Norm and Post-Norm. The authors propose to reinterpret these normalizations from an optimization perspective on a manifold, viewing the outputs of the FFN and attention layers as update directions. This novel perspective gives rise to GeoNorm, a method employing geodesic optimization rather than conventional normalization, integrated seamlessly with standard Transformer designs.

Theoretical Framework

The authors introduce a theoretical framework that interprets the operations within Transformer layers as optimization steps on a spherical manifold. In this framework, the FFN and Attention outputs serve as pseudo-gradient directions, and normalization is treated as projecting these updates back onto the sphere defined by the normalization radius. This perspective aligns transformation layers with iterative manifold optimization steps and motivates replacing conventional projections with geodesic updates, which promise smoother optimization trajectories and better convergence properties due to their intrinsic manifold consistency.

GeoNorm Normalization

GeoNorm represents an advancement by replacing traditional normalization with geodesic-based updates. This approach is augmented by a layer-wise update decay mechanism analogous to learning rate schedules, intended to stabilize and enhance the training dynamics of Transformer models. The geodesic updates ensure that transformation steps remain coherent with the manifold's geometry, preserving the structural integrity of sequential update directions across layers.

Empirical Validation

Experimental evaluations underline the effectiveness of GeoNorm across diverse datasets and Transformer configurations. GeoNorm consistently outperforms existing methods like Pre-Norm, Post-Norm, DeepNorm, and SandwichNorm. GeoNorm's advantage is evident in various settings, demonstrating robustness across different model sizes, such as 125M, 350M, and up to 1.3B parameters, and showing superior performance when evaluated in terms of loss metrics and convergence stability. The method's implementation is computationally efficient, effectively integrating into existing architectures without significant overhead.

Performance Across Different Training Lengths and Models

The paper reports GeoNorm's consistent superiority over various training lengths and datasets. With training extended to lengths like 1024 and beyond, GeoNorm maintains its performance edge, with better convergence rates and stability. These results indicate that GeoNorm not only achieves initial performance gains but sustains its advantages as training progresses, highlighting its capacity for stable training dynamics free from the common loss spikes seen with other normalization methods.

Implications and Future Perspectives

The application of geodesic optimization to normalization in Transformers marks a shift towards exploiting manifold geometry to enhance neural network training. This approach paves the way for further investigations into Riemannian methods across broader machine learning contexts, potentially leading to novel optimization techniques beyond the immediate scope of Transformers. Looking forward, GeoNorm might inspire adaptations in other architectures, promising advancements in model stability and efficiency.

Conclusion

"GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization" advances our understanding of normalization within Transformers from a manifold optimization perspective. By introducing geodesic updates, it achieves significant empirical success and stability improvements across diverse datasets and model configurations, providing a robust framework for normalization in large-scale neural networks. This theoretical and practical enhancement holds implications for future research directions in optimization on manifolds and neural network architectures.

Markdown Report Issue