mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations

Published 9 Jan 2026 in cs.LG and cs.AI | (2601.05732v1)

Abstract: Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc-lite.

Abstract PDF Upgrade to Chat

Summary

The paper introduces mHC-lite, which directly parameterizes residual matrices as convex combinations of permutation matrices to achieve exact doubly stochasticity.
It eliminates the need for iterative Sinkhorn–Knopp normalization, thereby improving training stability and computational efficiency in deep neural networks.
Empirical evaluations demonstrate superior performance and gradient stability across various datasets and deep model architectures.

mHC-lite: Reparameterizing Residual Connections with Exact Doubly Stochasticity

Introduction and Motivation

The evolution of residual architectures in deep neural networks has underpinned the scalable training of deep models, enabling advances in Transformer-based LLMs and other modern architectures. Hyper-Connections (HC) extend classical residual mappings by dynamically mixing multiple streams through learnable residual matrices, accelerating convergence and potentially improving expressivity. Manifold-Constrained Hyper-Connections (m) sought to enhance stability by projecting these residual matrices onto the Birkhoff polytope—i.e., the set of doubly stochastic matrices—via iterative Sinkhorn–Knopp (SK) normalization.

However, finite SK iterations do not ensure exact doubly stochasticity, introducing an approximation gap and propagating instability, especially as model depth increases. The mHC-lite proposal leverages the Birkhoff–von Neumann theorem to parameterize these matrices directly as convex combinations of permutation matrices, thus guaranteeing exact doubly stochasticity without iterative projection or reliance on specialized kernels.

Figure 1: Schematic contrasting SK-based construction of residual matrices in m versus exact convex combination in m-lite.

Methodology

mHC-lite reforms the dynamic residual matrix, replacing the iterative SK normalization with a direct, efficient, and theoretically exact construction. For $n$ residual streams, every doubly stochastic matrix in the Birkhoff polytope is represented as a convex combination of $n!$ permutation matrices. The weights for these permutations are produced via a trainable softmax layer conditioned on layer inputs, with the explicit matrix assembly implemented via native matrix multiplications.

This reparameterization strictly enforces row and column sums to unity (doubly stochasticity), ensuring spectral norm control and stability under repeated matrix composition across layers. Moreover, by sidestepping specialized CUDA kernels and the need to store/recompute intermediate iteration results, this method is portable and computationally friendly.

Theoretical and Empirical Analysis

The SK algorithm is provably non-uniform in convergence for ill-conditioned matrices, requiring up to $O\left(\frac{n^2\log (n/\nu)}{\epsilon^2}\right)$ iterations for $\ell_1$ -error at most $\epsilon$ , where $\nu$ is the relative entry range. Empirical evaluation reveals that a substantial fraction of input matrices to SK in the m block have extremely large relative range ( $1/\nu \ge 10^{13}$ ), resulting in persistent deviation from the doubly stochastic constraint after the standard (20) iterations.

Figure 2: Distribution of $\log(1/\nu)$ for SK input matrices, indicating slow convergence potential for high relative range.

This deviation manifests in unstable column sums in layerwise and stacked residual matrices, which becomes more pronounced with increased depth, threatening gradient stability and learning dynamics.

Figure 3: Boxplots of column sums for single-layer and multi-layer (stacked product) ${\boldsymbol{H}^{\text{res}}$ matrices; m-lite maintains exact unity.

mHC-lite, by construction, eliminates these approximation gaps. Empirical results demonstrate that mHC-lite matches or exceeds m in training and validation loss across all tested datasets and model scales, with superior gradient norm stability.

Figure 4: Gradient norm trajectories during training; m-lite exhibits lower mean and fluctuation compared to m.

Computational Efficiency

The computational efficiency of mHC-lite is a direct consequence of its implementation with standard matrix operations. Unlike m, whose efficiency depends on bespoke kernel optimization, mHC-lite achieves higher throughput, even in naive implementations. System-level benchmarks show a token throughput advantage for mHC-lite over both HC and m (the latter implemented without kernel fusion).

Figure 5: Training token throughput comparisons on A100 GPUs; m-lite exceeds HC and PyTorch-based m implementations.

Scalability is achievable for practical $n$ (e.g., $n=4$ streams, yielding $n!=24$ permutations), with alternative strategies such as sampling subsets for larger $n$ .

Practical and Theoretical Implications

mHC-lite provides a rigorous mechanism for stability in deep architectures, relevant for highly scaled models (e.g., recent 1,000-layer networks in RL). The paradigm shift from approximate projection to exact reparameterization of constrained matrices sets a template for other constraint enforcement problems in neural network design, removing sources of instability and engineering overhead.

The guaranteed stability of mHC-lite enables further architectural exploration: deeper models, more expressive multi-stream residual structures, and simplified deployment in diverse training frameworks. For adaptive or heterogenous architectures, the convex combination mechanism may also facilitate regularization or interpretability of routing through permutations.

Conclusion

mHC-lite constitutes a robust alternative to iterative projection-based manifold constraint, enforcing exact doubly stochasticity in residual connectivity with minimal computational and engineering cost. The design addresses lingering instabilities in prior architectures, maintains or improves model performance, and enhances computational efficiency and portability. The reparameterization strategy, grounded in foundational matrix theory, invites broader applications and paves the way for stable, expressive residual designs in future AI systems.

Markdown Report Issue