Maximum Residual Aggregator

Updated 4 February 2026

Maximum Residual Aggregator is a method that selects row or block updates based on the largest residual, enhancing convergence in iterative solvers and neural architectures.
In iterative linear solvers like MRBK and MEMRK, focusing on maximal residual blocks leads to faster linear convergence and reduced iterations.
In neural networks, replacing additive residuals with max-based aggregation (MaxProp or LeakyMax) yields improved gradient flow and early learning dynamics.

A maximum residual aggregator is an element-wise or block-wise selection rule designed to prioritize updates to the components or blocks corresponding to the largest residuals in iterative numerical algorithms or neural network architectures. The concept surfaces in two distinct but mathematically analogous contexts: (1) iterative solvers for large linear systems, and (2) neural network block aggregation—most notably as an alternative to additive residual connections.

1. Maximum Residual Aggregator in Iterative Linear Solvers

The maximum residual aggregator is fundamental in modern variants of the Kaczmarz method for solving both consistent and inconsistent linear systems. It dictates the selection of the row or block with the maximum residual to focus the iterative update, thereby accelerating convergence.

Residual and Block Definitions

Given a linear system $A x = b$ , with $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^m$ , the (block) residual vector at iteration $k$ is

$r_k = b - A x_k.$

Partitioning the $m$ rows into $q$ disjoint blocks $V_1,\ldots,V_q$ , the block residuals are $r_k^{(i)} = b_{V_i} - A_{V_i} x_k$ with $\|r_k^{(i)}\|_2$ the block's residual norm.

Maximum Residual Aggregation Rule

At each iteration, the block or row index is chosen according to the largest residual norm:

$I_k = \arg\max_{1\leq i \leq q} \|r_k^{(i)}\|_2 \qquad \text{or} \qquad i_k = \arg\max_{1\le i\le m} |r_k^{(i)}|.$

This greedy selection focuses updates on the components (rows or blocks) that are maximally inconsistent with the current iterate.

Example: MRBK and MEMRK Algorithms

Maximum Residual Block Kaczmarz (MRBK): At each step, the method projects onto the solution space of the block with maximal residual norm using the Moore–Penrose pseudo-inverse of that block (Sun et al., 2024).
Multi-step Extended Maximum Residual Kaczmarz (MEMRK): For inconsistent systems, an auxiliary variable $z$ is iteratively projected toward the orthogonal complement of $\mathcal{R}(A)$ over multiple inner iterations, followed by a row-update using the maximum residual component (Xiao et al., 2023).

2. Maximum Residual Aggregator in Neural Network Architectures

In deep learning, the maximum residual aggregator refers to modifications of standard residual connections in block architectures such as ResNets. Here, instead of additive fusion, the block output is the element-wise maximum (MaxProp), or a convex combination (LeakyMax), of the input and the block transformation (Fuhl, 2021).

Definitions and Mathematical Formulation

Standard Residual Block: $\mathrm{Residual}(x) = x + f(x)$
Maximum Propagation (MaxProp) Block: $\mathrm{MaxProp}(x) = \max(x, f(x))$ (elementwise maximum)
Leaky Maximum Block:

$\mathrm{LeakyMax}(x) = \begin{cases} \alpha f(x) + \beta x, & \text{if } f(x) \geq x \ \alpha x + \beta f(x), & \text{otherwise} \end{cases}$

with fixed $\alpha,\beta \geq 0$ , $\alpha + \beta = 1$ .

This approach is motivated by maxout networks and is essentially a feature-wise hard selection or gating mechanism.

3. Algorithmic Frameworks and Pseudocode

Maximum Residual Kaczmarz Variants

Algorithm	Aggregation Rule	Update Type
MRBK	Max block residual norm ( $\ell_2$ )	Projection using $A_{V_{I_k}}^\dagger$ (Sun et al., 2024)
MEMRK	Max row absolute residual ( $\ell_\infty$ )	Greedy row update with multi-step auxiliary projections (Xiao et al., 2023)

MRBK Pseudocode

for k in range(ell):
    for i in range(q):
        r_k_i = b[V_i] - A[V_i] @ x_k
    I_k = argmax([np.linalg.norm(r_k_i) for i in range(q)])
    x_{k+1} = x_k + A[V_{I_k}].pinv() @ (b[V_{I_k}] - A[V_{I_k}] @ x_k)

MEMRK Main Loop

Update $z_{k+1}$ by $\omega$ Kaczmarz-style column-projections.
Compute residual $r_k = b - A x_k - z_{k+1}$ .
Select $i_k = \arg\max_i |r_k^{(i)}|$ .
Update $x_{k+1} = x_k + (r_k^{(i_k)}/\|A^{(i_k)}\|_2^2)(A^{(i_k)})^T$ .

Neural Block Aggregation

Replacement in deep ResNet-style architectures can be performed by substituting addition with element-wise $\max$ (or LeakyMax as above).

4. Convergence Theory and Empirical Performance

Linear Systems

Maximum residual aggregation methods achieve provably faster linear convergence than both cyclic and randomized selection strategies, provided bounds on the block singular values. For MRBK (Sun et al., 2024):

$\|x_k - x_\star\|_2^2 \leq \left(1 - \frac{\sigma_{\min}^2(A)}{\beta (q-1)}\right)^k \|x_0 - x_\star\|_2^2$

where $\beta$ is an upper bound on block singular values and $q$ is the block count.

For inconsistent systems, the MEMRK method yields linear convergence with a nonzero tolerance dominated by the consistency error, further suppressed by multi-step auxiliary iterations (Xiao et al., 2023).

Neural Architectures

Empirical findings (Fuhl, 2021) indicate:

MaxProp and LeakyMax blocks can provide faster initial learning (steeper loss drop) than addition.
Generalization matches or exceeds standard addition when batch normalization (BN) is fixed or absent.
LeakyMax blocks mitigate "dead" layers that can arise from hard max gating.

Limitations include a slight increase in per-block computational overhead and, for very deep single models, marginally lower ultimate accuracy than standard additive blocks.

5. Practical Deployment and Parameter Choices

Blocked Iterative Methods

Block partitioning ( $q$ ): Uniform random partitions with $q \sim \|A\|_2^2$ are typical.
MRBK vs. MRABK: MRBK requires block pseudo-inverse solves; MRABK replaces this by averaging weighted single-row projections, reducing computational expense.
Relaxation parameters in MRABK: $\omega\in(0,2)$ ; typically $\omega=1$ .

Deep Networks

Replacement: Substitute $x + f(x)$ by $\max(x, f(x))$ or LeakyMax.
Parameters: Fixed $\alpha=0.9, \beta=0.1$ is effective for LeakyMax, but tuning across $\alpha\in[0.5,1]$ is possible.
BN/Activation: MaxProp/LeakyMax can generalize without additional nonlinearity or standard BN; addition requires ReLU for stable training.

6. Comparative Analysis and Application Domains

Kaczmarz-Style Solvers

Maximum residual aggregation substantially accelerates convergence compared to cyclic, randomized, or greedy-randomized Kaczmarz variants, especially on large, well-paved systems. For inconsistent or tomographic systems, MEMRK methods realized up to $3$– $5\times$ fewer iterations and reduced CPU time compared to standard and randomized extended Kaczmarz, accompanied by improved PSNR in image reconstruction tasks (Xiao et al., 2023).

Neural Networks

MaxProp and LeakyMax aggregators are especially effective in small and medium-sized architectures or ensemble settings. Mixed-aggregator ensembles outperform homogeneous ensembles. MaxProp/LeakyMax demonstrate robust generalization even with fixed or no batch normalization, suggesting suitability in adversarial, low-precision, or hardware-constrained deployments (Fuhl, 2021).

7. Advantages, Limitations, and Recommendations

Advantages:

Focused projection onto maximally inconsistent components accelerates convergence in iterative solvers.
Dynamic path selection in deep networks, enabling implicit ensembles and gradient flow enhancements.
Improved generalization and robust training under constrained BN/activation setups in neural networks.

Limitations:

Slight increased per-iteration or per-block computational complexity.
Gating in MaxProp can cause zero-gradient "dead" blocks; LeakyMax mitigates but requires parameter tuning.
Superiority over addition not established for very deep or wide nets in single-network regimes; all neural network guarantees are empirical.

Recommended Contexts:

Large-scale linear system solvers benefiting from greedy update schedules.
Deep architectures where robustness, gradient flow, or initial convergence are critical, especially in ensemble training or deployment environments with limited computational resources or BN constraints.

Papers for further reference: "On multi-step extended maximum residual Kaczmarz method for solving large inconsistent linear systems" (Xiao et al., 2023), "Maximum and Leaky Maximum Propagation" (Fuhl, 2021), "On maximum residual block Kaczmarz method for solving large consistent linear systems" (Sun et al., 2024).

Markdown Report Issue Upgrade to Chat

References (3)

On maximum residual block Kaczmarz method for solving large consistent linear systems (2024)

On multi-step extended maximum residual Kaczmarz method for solving large inconsistent linear systems (2023)

Maximum and Leaky Maximum Propagation (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Maximum Residual Aggregator.