Value Residual Connections

Updated 3 January 2026

Value residual connections are mechanisms that propagate early-layer value streams across deeper layers, preserving token-level information.
Architectures like ResFormer and SVFormer leverage these connections to achieve efficiency in parameters and data while mitigating over-smoothing.
They improve optimization stability and offer adaptability across models including Transformers, vision networks, and graph neural networks.

A value residual connection is a neural architecture mechanism that delivers information from one (often early) set of value representations to deeper layers, typically in addition to or as a structured extension of the classic hidden-state or identity skip connection. In contrast to standard residuals—which add a layer’s output back to its input to preserve representational fidelity and gradient stability—value @@@@1@@@@ propagate value streams, such as the value projections used in self-attention, from a designated layer (often the first) into the attention computation at subsequent layers. This approach aims to address information bottlenecks, over-smoothing, and the loss of token-level or local features in deep architectures, with empirical and theoretical backing in Transformers, vision networks, and graph neural networks.

1. Concept and Formal Distinction from Classic Residuals

In standard Transformers, each layer $n$ computes projected queries $Q_n$ , keys $K_n$ , and values $V_n$ from the hidden state $H_n$ , computes an attention output $U_n = \mathrm{Softmax}(Q_n K_n^\top / \sqrt{d}) \cdot V_n$ , and then updates the hidden state using a hidden-state residual:

$H_{n+1} = \mathrm{LayerNorm}(H_n + U_n + \mathrm{MLP}(U_n))$

A value residual connection, in the sense of ResFormer (Zhou et al., 2024), augments this by modifying the values fed into attention:

Instead of using $V_n$ alone, the values are enriched at every layer $n \ge 2$ with a persistent stream $V_1$ from the first layer:

$U_n = \mathrm{Softmax}(Q_n K_n^\top / \sqrt{d}) \cdot (V_n + V_1)$

Only the attention-value pathway is affected; the skip connection on $H_n$ persists as in the original.

In SVFormer, a more extreme variant, once $V_1$ is computed, all layers $n \ge 2$ entirely reuse $V_1$ in place of new value projections:

$U_n = \mathrm{Softmax}(Q_n K_n^\top / \sqrt{d}) \cdot V_1$

Unlike hidden-state residuals, value residuals introduce cross-layer “shortcuts” specifically for the values used in self-attention, thereby addressing the degradation of per-token information with depth.

2. Implementation: Forward Pass and Schematics

The ResFormer and SVFormer architectures can be specified concisely:

ResFormer (for $n \ge 2$ ):

Qn, Kn, Vn = LinearProj(Hn)
An = softmax(Qn @ Kn.T / sqrt(d))
Vres = Vn + V1
Un = An @ Vres
Hn_plus_1 = LayerNorm(Hn + Un + MLP(Un))

SVFormer (for $n \ge 2$ ):

Qn, Kn = LinearProj(Hn)  # Vn not computed
An = softmax(Qn @ Kn.T / sqrt(d))
Un = An @ V1
Hn_plus_1 = LayerNorm(Hn + Un + MLP(Un))

These changes require minimal architectural modification and no extra matrix multiplications, as the value residual is a simple addition and, in SVFormer, projection layers are omitted after the first.

3. Theoretical and Empirical Impact

Parameter and Data Efficiency

ResFormer attains equivalent validation loss with ~10-14% fewer parameters and ~13.6-20.3% less data compared to baseline Transformers at the same scale (Zhou et al., 2024).

Information Flow and Representational Capacity

Value residuals preserve token-level information through depth and mitigate the “attention concentration ↔ value-drain” feedback loop, thereby countering over-smoothing effects observed when only hidden-state residuals are used (Zhou et al., 2024).
MUDDFormer extends the concept to multiway and position-wise dynamic value residuals, allowing dynamic dense aggregation of all previous layer outputs for each of the query, key, value, and residual streams, which improves cross-layer signal propagation and prevents representation collapse (Xiao et al., 13 Feb 2025).

Training and Inference

The computational cost of ResFormer matches that of the vanilla Transformer, while SVFormer can halve the required key-value cache at inference, enabling significant memory savings for long-sequence contexts (Zhou et al., 2024).
KV-efficient methods can be integrated, providing further savings in memory and runtime.
In MUDDFormer, dynamic value–residual connections add <0.5% parameter/FLOP overhead, but yield loss curves and downstream accuracies equivalent to or better than models trained with 1.8–2.4× more compute (Xiao et al., 13 Feb 2025).

Optimization Stability

Value residuals provably improve the conditioning of the self-attention output, preventing rank collapse (i.e., ensuring a positive spectral gap) and guaranteeing a linear convergence rate for gradient descent in both single- and multi-layer Transformers (Qin et al., 5 Jun 2025).
Without value residuals, softmax attention matrix outputs can be nearly rank-one at high dimension, stalling optimization.

Robustness to Deep Network Pathologies

Residual/skip connections prevent “dissipating input” phenomena, where repeated nonlinear layers cause the input to degrade to random noise, as shown by lower bounds on the fraction of surviving neurons and preserved features over depth (Zhang et al., 2024).
These properties generalize from classic hidden-state residuals to any cross-layer path, including value residuals (see PNNH, the Plain Neural Net Hypothesis).

4. Value Residual Connections Beyond Transformers

In graph neural networks (GNNs), “adaptive initial residual connections” constitute a value-residual mechanism, where each node’s initial embedding is injected into every layer and is modulated by a learned (or heuristic) strength per node (Shirzadi et al., 10 Nov 2025):

$H^{(\ell+1)} = \sigma\left( \Lambda A H^{(\ell)} W^{(\ell)} + (I - \Lambda) H^{(0)} \Theta^{(\ell)} \right)$

Here, $\Lambda = \mathrm{diag}(\lambda_1, \dots, \lambda_n)$ , with node-specific $\lambda_i \in (0,1)$ , ensuring that oversmoothing is theoretically precluded by maintaining Dirichlet energy bounded away from zero. This mechanism enables deep GNNs to retain expressiveness even on heterophilic graphs.

5. Variants and Trade-offs: Decayed and Weighted Residuals

Instead of a strict addition or replacement, variants such as value-weighted (decayed) residuals modulate the strength of the identity or value skip connection as a function of layer depth:

$X_{l+1} = \alpha_l X_l + f_l(X_l), \quad \alpha_l = 1 - \delta_\alpha \cdot l, \quad \delta_\alpha = \frac{1-\alpha_{\min}}{L}$

Decreasing $\alpha_l$ with depth forces the model to develop more abstract and higher-level features, as the contribution from low-level identity paths shrinks exponentially. Empirically, such decayed residuals elevate linear-probe accuracy in generative settings (e.g., from 67.3% to 72.3% in ViT-B/16 MAE) and are correlated with compression of feature rank (Zhang et al., 2024).

6. Extensions, Limitations, and Future Research

Extensions

Value residuals have been proposed for Vision Transformers, speech and audio sequence models, large-context retrieval systems, and mixture-of-experts or multitask architectures (Zhou et al., 2024).
Multiway dynamic dense value-residual connections (MUDD) generalize this approach by enabling per-position, per-stream adaptive aggregation for queries, keys, values, and hidden states (Xiao et al., 13 Feb 2025).

Limitations

Value residual mechanisms may require additional hyperparameter tuning (e.g., residual strength decay rate) and may underperform if not properly modulated for the specific task or depth (Zhang et al., 2024).
In GNNs, adaptive value residuals introduce additional parameters (learnable per-node strengths) or require heuristic assignments (e.g., via PageRank), with positive theoretical guarantees contingent on alignment and regular graph properties (Shirzadi et al., 10 Nov 2025).

Ongoing Research

Theoretical generalization to more flexible residual schemes and further empirical exploration of value-residual mechanisms in dense and multiway settings are active directions.
The relationship between cross-layer value pathways, feature abstraction, and robustness (e.g., to over-smoothing or collapse) continues to be examined across domains.

7. Summary of Core Insights

Value residual connections provide a route for token-level or node-level information to propagate deeply without degradation.
Architectures such as ResFormer and SVFormer utilize this to achieve parameter efficiency, data efficiency, and improved performance in both training and inference (Zhou et al., 2024).
Dynamic multiway residual architectures (MUDDFormer) further enhance model capacity and information flow with minimal overhead (Xiao et al., 13 Feb 2025).
Decayed or value-weighted skips promote abstraction by controlling the depth-wise contribution of identities, crucial for high-quality feature learning in generative models (Zhang et al., 2024).
In GNNs, value residuals (adaptive initial residual connections) offer the first rigorous guarantee against oversmoothing with activation functions (Shirzadi et al., 10 Nov 2025).
Robust theoretical and empirical evidence supports value residual connections as a critical tool for overcoming cross-layer information bottlenecks in deep and scalable neural networks.