Residual Block Formulation

Updated 22 January 2026

Residual block formulation is a design that adds a mapping output (F(x)) to the input, enabling iterative refinement and efficient convergence.
In deep networks, residual blocks mitigate vanishing gradients by incorporating identity skip connections alongside normalization and activation layers.
In numerical linear algebra, block residuals drive methods like Krylov subspace techniques and block Kaczmarz iterations to enforce orthogonality and accelerate convergence.

A residual block formulation defines an architectural or algorithmic primitive in which outputs of a (possibly nonlinear, learned, or iterative) mapping are combined with their own inputs via addition or projection. This operator is central both in modern deep networks, as exemplified by ResNets, and in block-structured iterative linear algebra, such as block Krylov subspace methods and block Kaczmarz iterations. While the semantics and purpose of the "residual block" differ by context, its generic mathematical form is an update $y = x + F(x)$ or a projection $y = x - A^\dagger r$ , with $F$ or $A$ structured to enable efficient learning, iterative refinement, orthogonalization, or multi-vector acceleration.

1. General Formulation of Residual Blocks

The canonical expression for a residual block in feedforward architectures is

$h_{i+1} = h_i + F_i(h_i)$

where $F_i$ is a nonlinear operator, typically parameterized by weight matrices and incorporating batch normalization and activation functions. In Krylov methods and iterative block solvers, the block residual at iteration $k$ is usually defined as

$R_k = B - A X_k$

with $X_k$ a block of approximate solutions and $B$ the block of right-hand sides. Block residuals serve as both a direction for further refinement and an object for enforcing block-orthogonality, block-minimization, or block-projection (Jastrzębski et al., 2017, Soodhalter, 2013, Gu et al., 2016, Sun et al., 2024, Massei et al., 7 Apr 2025).

2. Residual Block Design in Deep Networks

In deep learning, residual blocks enable identity skip connections, directly mitigating vanishing gradient phenomena and enabling iterative feature refinement. The general two-layer residual unit is

$y = x + \mathcal{F}(x)$

with $\mathcal{F}(x)$ typically a sequence of two convolution-BN-ReLU (or similar) layers. Systematic investigations reveal multiple implementation alternatives, differing in the placement of batch normalization (BN) and activation (ReLU) with respect to the addition:

Variant	$\mathcal{F}(x)$ (Main Branch)	$y$ (Residual Merge)
RB1	BN(Conv2(ReLU(Conv1( $x$ ))))	ReLU( $\mathcal{F}(x) + x$ )
RB2	Conv2(ReLU(Conv1( $x$ )))	ReLU(BN( $\mathcal{F}(x)+x$ ))
RB3	BN(Conv2(ReLU(Conv1( $x$ ))))	$\mathcal{F}(x) +$ ReLU( $x$ )
RB4	BN(Conv2(Conv1(ReLU( $x$ ))))	$\mathcal{F}(x) + x$
RB5	Conv2(ReLU(BN(Conv1(ReLU(BN( $x$ ))))))	$\mathcal{F}(x) + x$
RB6	BN(Conv2(ReLU(Conv1( $x$ ))))	ReLU(BN( $\mathcal{F}(x)+x$ ))

These alternatives significantly affect end-to-end accuracy and optimization stability, with the best-performing variant depending on input normalization and domain (Naranjo-Alcazar et al., 2019).

Analytically, the residual block structure induces an update in feature space that approximates gradient descent on the layerwise loss:

$\mathcal{L}(h_{L}) = \mathcal{L}(h_{L-1}) + \langle F_{L-1}(h_{L-1}), \partial\mathcal{L}/\partial h_{L-1} \rangle + O(\|F_{L-1}\|^2)$

Empirically, $F_j(h_j)$ aligns negatively with the loss gradient, especially in higher network layers, confirming the iterative refinement interpretation (Jastrzębski et al., 2017).

3. Block Residuals in Krylov and Subspace Methods

Block Krylov subspace methods generalize single-vector approaches by propagating and updating blocks of vectors simultaneously. The block Arnoldi or Lanczos process produces an orthonormal basis $U_j$ spanning a block Krylov subspace, with each iteration enforcing residual orthogonality conditions:

$R_{k+1} \perp \mathcal{L}, \quad A P_k \perp \mathcal{L}$

Here, $R_k$ is a block residual, $P_k$ a block search direction, and the constraint subspace $\mathcal{L}$ is constructed using either $A$ or $A^2$ conjugate orthogonality (Gu et al., 2016). In block MINRES based on the banded Lanczos method, the block residual at iteration $j$ ,

$F_j = B - A X_j$

is minimized in Frobenius norm over a block Krylov space, with the minimization reducible to a small block least-squares system (Soodhalter, 2013).

In block rational Krylov approximations of matrix functions, the residual is further characterized by a block generalization of characteristic polynomials and collinearity relations:

$R_{B,j}(z) = [\Lambda^U(A) \circ B][\Lambda^U(z)]^{-1}$

with $\Lambda^U(\lambda)$ a block characteristic polynomial, enabling a hierarchy of error formulas and posteriori norm bounds (Massei et al., 7 Apr 2025).

4. Recycling and Augmented Block Arnoldi Residuals

In recycled Krylov and augmented Arnoldi methods, the block residual formulation is central for integrating a recycled subspace $U$ with new Krylov bases $V_k$ . The decomposition

$A [U, V_k] = [U, V_{k+1}] H_{k+1}$

leads to a residual expression

$R_k = [U, V_{k+1}] (H_{k+1} y_k - \beta e_1)$

A block lower-triangular correction $T_k = U^\top A V_k$ is included to orthogonalize the Krylov block against $U$ , followed by an inverse compact WY-modified Gram-Schmidt step for efficient and robust inter-block orthogonalization. To further accelerate convergence, a weighted oblique projection step is used:

$P = U (U^\top W U)^{-1} U^\top W$

applied to the residual, where $W$ is a weight matrix reflecting residual and recycle subspace alignment (Thomas et al., 2023).

5. Residual Block Methods in Stochastic Iterative Linear Solvers

Block partitioning and residual updates underpin the design of block Kaczmarz-type methods. At iterate $x_k$ , block residuals per partition $V_i$ are

$r_k^{(i)} = b_{V_i} - A_{V_i} x_k$

The maximum-residual block Kaczmarz method deterministically selects the block with the maximal residual norm and projects $x_k$ orthogonally via the pseudoinverse:

$x_{k+1} = x_k + A_{V_{i_k}}^\dagger (b_{V_{i_k}} - A_{V_{i_k}} x_k)$

A relaxation-based version (MRABK) computes a tailored step-size, replaces the full pseudoinverse with row-averaged updates, and provably achieves faster linear convergence rates than randomized block Kaczmarz (Sun et al., 2024).

6. Invertible Residual Block Flows in Generative Modeling

Residual block composition is also fundamental in the construction of invertible normalizing flows. A residual block on $\mathbb{R}^d$ ,

$f_\theta(x) = x + g_\theta(x), \quad \text{Lip}(g_\theta) \leq 1/2$

ensures invertibility by Banach’s fixed-point theorem. Stacking such blocks yields a flow

$F(x) = f_{\theta_N} \circ \cdots \circ f_{\theta_1}(x)$

Universal approximation in maximum mean discrepancy (MMD) can be achieved by stacking $N = O(\log(1/\delta))$ such blocks, with explicit first- and second-order bounds on MMD reduction rates (Kong et al., 2021).

7. Analysis of Residual Block Effectiveness and Practical Considerations

Residual block formulations, whether in neural architectures or iterative block methods, share the objective of enabling efficient, stable, and scalable progression toward solution or representation refinement. Empirical studies confirm that fine details within block structure (e.g., nonlinear placement, normalization order, block-relative orthogonality) can control convergence and generalization in numerical and learning contexts. The adaption of residual block orthogonalization, block correction, and weighted projections in large-scale, multi-right-hand-side, or recycling contexts further improves efficiency, with measurable reductions in iteration counts and computational cost (Thomas et al., 2023, Massei et al., 7 Apr 2025).

A comprehensive view reveals that block residual formulations are not only a structural convenience but a mathematically expressive and algorithmically pivotal ingredient across numerical linear algebra, optimization, and modern machine learning.