Papers
Topics
Authors
Recent
Search
2000 character limit reached

Maximum Residual Aggregator

Updated 4 February 2026
  • Maximum Residual Aggregator is a method that selects row or block updates based on the largest residual, enhancing convergence in iterative solvers and neural architectures.
  • In iterative linear solvers like MRBK and MEMRK, focusing on maximal residual blocks leads to faster linear convergence and reduced iterations.
  • In neural networks, replacing additive residuals with max-based aggregation (MaxProp or LeakyMax) yields improved gradient flow and early learning dynamics.

A maximum residual aggregator is an element-wise or block-wise selection rule designed to prioritize updates to the components or blocks corresponding to the largest residuals in iterative numerical algorithms or neural network architectures. The concept surfaces in two distinct but mathematically analogous contexts: (1) iterative solvers for large linear systems, and (2) neural network block aggregation—most notably as an alternative to additive residual connections.

1. Maximum Residual Aggregator in Iterative Linear Solvers

The maximum residual aggregator is fundamental in modern variants of the Kaczmarz method for solving both consistent and inconsistent linear systems. It dictates the selection of the row or block with the maximum residual to focus the iterative update, thereby accelerating convergence.

Residual and Block Definitions

Given a linear system Ax=bA x = b, with ARm×nA \in \mathbb{R}^{m \times n} and bRmb \in \mathbb{R}^m, the (block) residual vector at iteration kk is

rk=bAxk.r_k = b - A x_k.

Partitioning the mm rows into qq disjoint blocks V1,,VqV_1,\ldots,V_q, the block residuals are rk(i)=bViAVixkr_k^{(i)} = b_{V_i} - A_{V_i} x_k with rk(i)2\|r_k^{(i)}\|_2 the block's residual norm.

Maximum Residual Aggregation Rule

At each iteration, the block or row index is chosen according to the largest residual norm:

Ik=argmax1iqrk(i)2orik=argmax1imrk(i).I_k = \arg\max_{1\leq i \leq q} \|r_k^{(i)}\|_2 \qquad \text{or} \qquad i_k = \arg\max_{1\le i\le m} |r_k^{(i)}|.

This greedy selection focuses updates on the components (rows or blocks) that are maximally inconsistent with the current iterate.

Example: MRBK and MEMRK Algorithms

  • Maximum Residual Block Kaczmarz (MRBK): At each step, the method projects onto the solution space of the block with maximal residual norm using the Moore–Penrose pseudo-inverse of that block (Sun et al., 2024).
  • Multi-step Extended Maximum Residual Kaczmarz (MEMRK): For inconsistent systems, an auxiliary variable zz is iteratively projected toward the orthogonal complement of R(A)\mathcal{R}(A) over multiple inner iterations, followed by a row-update using the maximum residual component (Xiao et al., 2023).

2. Maximum Residual Aggregator in Neural Network Architectures

In deep learning, the maximum residual aggregator refers to modifications of standard residual connections in block architectures such as ResNets. Here, instead of additive fusion, the block output is the element-wise maximum (MaxProp), or a convex combination (LeakyMax), of the input and the block transformation (Fuhl, 2021).

Definitions and Mathematical Formulation

  • Standard Residual Block: Residual(x)=x+f(x)\mathrm{Residual}(x) = x + f(x)
  • Maximum Propagation (MaxProp) Block: MaxProp(x)=max(x,f(x))\mathrm{MaxProp}(x) = \max(x, f(x)) (elementwise maximum)
  • Leaky Maximum Block:

LeakyMax(x)={αf(x)+βx,if f(x)x αx+βf(x),otherwise\mathrm{LeakyMax}(x) = \begin{cases} \alpha f(x) + \beta x, & \text{if } f(x) \geq x \ \alpha x + \beta f(x), & \text{otherwise} \end{cases}

with fixed α,β0\alpha,\beta \geq 0, α+β=1\alpha + \beta = 1.

This approach is motivated by maxout networks and is essentially a feature-wise hard selection or gating mechanism.

3. Algorithmic Frameworks and Pseudocode

Maximum Residual Kaczmarz Variants

Algorithm Aggregation Rule Update Type
MRBK Max block residual norm (2\ell_2) Projection using AVIkA_{V_{I_k}}^\dagger (Sun et al., 2024)
MEMRK Max row absolute residual (\ell_\infty) Greedy row update with multi-step auxiliary projections (Xiao et al., 2023)

MRBK Pseudocode

1
2
3
4
5
for k in range(ell):
    for i in range(q):
        r_k_i = b[V_i] - A[V_i] @ x_k
    I_k = argmax([np.linalg.norm(r_k_i) for i in range(q)])
    x_{k+1} = x_k + A[V_{I_k}].pinv() @ (b[V_{I_k}] - A[V_{I_k}] @ x_k)

MEMRK Main Loop

  1. Update zk+1z_{k+1} by ω\omega Kaczmarz-style column-projections.
  2. Compute residual rk=bAxkzk+1r_k = b - A x_k - z_{k+1}.
  3. Select ik=argmaxirk(i)i_k = \arg\max_i |r_k^{(i)}|.
  4. Update xk+1=xk+(rk(ik)/A(ik)22)(A(ik))Tx_{k+1} = x_k + (r_k^{(i_k)}/\|A^{(i_k)}\|_2^2)(A^{(i_k)})^T.

Neural Block Aggregation

Replacement in deep ResNet-style architectures can be performed by substituting addition with element-wise max\max (or LeakyMax as above).

4. Convergence Theory and Empirical Performance

Linear Systems

Maximum residual aggregation methods achieve provably faster linear convergence than both cyclic and randomized selection strategies, provided bounds on the block singular values. For MRBK (Sun et al., 2024):

xkx22(1σmin2(A)β(q1))kx0x22\|x_k - x_\star\|_2^2 \leq \left(1 - \frac{\sigma_{\min}^2(A)}{\beta (q-1)}\right)^k \|x_0 - x_\star\|_2^2

where β\beta is an upper bound on block singular values and qq is the block count.

For inconsistent systems, the MEMRK method yields linear convergence with a nonzero tolerance dominated by the consistency error, further suppressed by multi-step auxiliary iterations (Xiao et al., 2023).

Neural Architectures

Empirical findings (Fuhl, 2021) indicate:

  • MaxProp and LeakyMax blocks can provide faster initial learning (steeper loss drop) than addition.
  • Generalization matches or exceeds standard addition when batch normalization (BN) is fixed or absent.
  • LeakyMax blocks mitigate "dead" layers that can arise from hard max gating.

Limitations include a slight increase in per-block computational overhead and, for very deep single models, marginally lower ultimate accuracy than standard additive blocks.

5. Practical Deployment and Parameter Choices

Blocked Iterative Methods

  • Block partitioning (qq): Uniform random partitions with qA22q \sim \|A\|_2^2 are typical.
  • MRBK vs. MRABK: MRBK requires block pseudo-inverse solves; MRABK replaces this by averaging weighted single-row projections, reducing computational expense.
  • Relaxation parameters in MRABK: ω(0,2)\omega\in(0,2); typically ω=1\omega=1.

Deep Networks

  • Replacement: Substitute x+f(x)x + f(x) by max(x,f(x))\max(x, f(x)) or LeakyMax.
  • Parameters: Fixed α=0.9,β=0.1\alpha=0.9, \beta=0.1 is effective for LeakyMax, but tuning across α[0.5,1]\alpha\in[0.5,1] is possible.
  • BN/Activation: MaxProp/LeakyMax can generalize without additional nonlinearity or standard BN; addition requires ReLU for stable training.

6. Comparative Analysis and Application Domains

Kaczmarz-Style Solvers

Maximum residual aggregation substantially accelerates convergence compared to cyclic, randomized, or greedy-randomized Kaczmarz variants, especially on large, well-paved systems. For inconsistent or tomographic systems, MEMRK methods realized up to $3$–5×5\times fewer iterations and reduced CPU time compared to standard and randomized extended Kaczmarz, accompanied by improved PSNR in image reconstruction tasks (Xiao et al., 2023).

Neural Networks

MaxProp and LeakyMax aggregators are especially effective in small and medium-sized architectures or ensemble settings. Mixed-aggregator ensembles outperform homogeneous ensembles. MaxProp/LeakyMax demonstrate robust generalization even with fixed or no batch normalization, suggesting suitability in adversarial, low-precision, or hardware-constrained deployments (Fuhl, 2021).

7. Advantages, Limitations, and Recommendations

Advantages:

  • Focused projection onto maximally inconsistent components accelerates convergence in iterative solvers.
  • Dynamic path selection in deep networks, enabling implicit ensembles and gradient flow enhancements.
  • Improved generalization and robust training under constrained BN/activation setups in neural networks.

Limitations:

  • Slight increased per-iteration or per-block computational complexity.
  • Gating in MaxProp can cause zero-gradient "dead" blocks; LeakyMax mitigates but requires parameter tuning.
  • Superiority over addition not established for very deep or wide nets in single-network regimes; all neural network guarantees are empirical.

Recommended Contexts:

  • Large-scale linear system solvers benefiting from greedy update schedules.
  • Deep architectures where robustness, gradient flow, or initial convergence are critical, especially in ensemble training or deployment environments with limited computational resources or BN constraints.

Papers for further reference: "On multi-step extended maximum residual Kaczmarz method for solving large inconsistent linear systems" (Xiao et al., 2023), "Maximum and Leaky Maximum Propagation" (Fuhl, 2021), "On maximum residual block Kaczmarz method for solving large consistent linear systems" (Sun et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Maximum Residual Aggregator.