Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fusedmax: Continuous Sparsity and Transformer Acceleration

Updated 9 February 2026
  • Fusedmax is a framework that applies fused-lasso regularization to generate sparse continuous probability densities and optimize transformer attention hardware.
  • It uses total variation and Sobolev regularizers to produce interpretable, contiguous support in attention densities, improving model performance.
  • Fusedmax accelerates transformer inference by mapping extended Einsum computations onto spatial arrays, reducing memory buffers and energy consumption.

Fusedmax refers to two technically distinct but thematically related concepts in the machine learning literature: (1) a sparse mapping from scores to continuous probability densities that extends fused-lasso-like regularization to infinite-dimensional domains, and (2) a mapping of fused attention computation onto spatial hardware arrays for transformer acceleration. These advances leverage structured sparsity and efficient composition of linear and nonlinear operations, resulting in interpretable attention, improved computational efficiency, and buffer/memory reductions in both continuous optimization and hardware realization contexts (Martins et al., 2021, Nayak et al., 2024).

1. Continuous Fusedmax: Problem Definition and Regularization

Continuous fusedmax, as introduced in "Sparse Continuous Distributions and Fenchel-Young Losses," generalizes the fused-lasso principle to infinite-dimensional probability densities. For a domain SRS \subset \mathbb{R}, let Y={p:SR+Sp(t)dt=1,pH1(S)}\mathcal{Y} = \{p: S \to \mathbb{R}_+ \mid \int_S p(t)dt = 1, p \in H^1(S)\} denote the set of densities with integrable weak derivative pp'. Given a real-valued score or energy function η:SR\eta : S \to \mathbb{R}, fusedmax is formulated as the Ω\Omega-regularized prediction map:

fusedmax(η)=argminpY[Sη(t)p(t)dt+Ω(p)]\operatorname{fusedmax}(\eta) = \underset{p \in \mathcal{Y}}{\arg\min} \left[ -\int_S \eta(t) p(t) dt + \Omega(p) \right]

Two major regularizers Ω\Omega are considered:

  • Total Variation (TV) / Rudin-Osher-Fatemi (ROF)-style: ΩTV(p)=γSp(t)dt\Omega_{TV}(p) = \gamma \int_S |p'(t)| dt
  • Sobolev (quadratic gradient): ΩS(p)=γSp(t)2dt\Omega_{S}(p) = \gamma \int_S |p'(t)|^2 dt where γ>0\gamma > 0 is the smoothing parameter. The TV regularizer encourages piecewise constant solutions, while the Sobolev regularizer induces smooth densities (Martins et al., 2021).

2. Analytical Structure and Solution Properties

The continuous fusedmax solution, when equipped with the TV regularizer, is characterized by the Euler–Lagrange or subgradient conditions, incorporating a Lagrange multiplier to enforce p=1\int p = 1. The unconstrained ROF-type problem

minp12pηL22+γTV(p)\min_p \tfrac{1}{2}\|p - \eta\|^2_{L_2} + \gamma\, TV(p)

possesses a "taut-string" solution u(t)u^*(t); imposing normalization shifts η\eta by τ\tau and thresholding for nonnegativity:

p(t)=[u(t)τ]+p^*(t) = [u^*(t) - \tau]_+

For even, unimodal η\eta, uu^* is constant ("clipped") on an interval ta|t| \leq a, equaling η\eta elsewhere. The parameters a,τa, \tau solve scalar equations that relate the "fused" (flat) interval to the TV budget γ\gamma. For Sobolev regularization, the solution reduces to solving the linear ODE pγp=ητp - \gamma p'' = \eta - \tau under nonnegativity and normalization constraints, which yields closed-form solutions in terms of hyperbolic functions for symmetric η\eta (Martins et al., 2021).

Typical closed-form instances include:

  • For η(t)=t/σ\eta(t) = -|t|/\sigma: a=2σγa = \sqrt{2 \sigma \gamma}, τ=a/σ\tau = -a/\sigma, support =[σ(1+2γ),σ(1+2γ)]=[-\sqrt{\sigma(1+2\gamma)}, \sqrt{\sigma(1+2\gamma)}]
  • For η(t)=t2/(2σ2)\eta(t) = -t^2/(2\sigma^2): a=(3σ2γ)1/3a = (3\sigma^2 \gamma)^{1/3}, τ=12[(3(1+2γ)/2σ)2/3]\tau = -\frac{1}{2} \big[ (3(1+2\gamma)/2\sigma)^{2/3} \big]

This construction yields sparse, contiguous support for the resulting density, in contrast to the diffuse support of softmax.

3. Fenchel–Young Loss for Fusedmax

For any convex regularizer Ω\Omega, the Fenchel–Young loss is

LΩ(η;q)=Ω(η)ηq+Ω(q)L_\Omega(\eta; q) = \Omega^*(\eta) - \int \eta\,q + \Omega(q)

with LΩ(η;p)=0L_\Omega(\eta; p^*) = 0 if and only if p=fusedmax(η)p^* = \operatorname{fusedmax}(\eta), and Ω\Omega^* denotes the convex conjugate. For the TV case,

ΩTV(η)=12u(η)η22+γTV(u(η))\Omega^*_{TV}(\eta) = \tfrac{1}{2} \|u^*(\eta) - \eta\|_2^2 + \gamma\, TV(u^*(\eta))

where u(η)u^*(\eta) is the pre-rectification ROF solution. For the Sobolev regularizer,

ΩS(η)=Sηu(η)12u(η)η22γ2u(η)22\Omega_S^*(\eta) = \int_S \eta\,u^*(\eta) - \tfrac{1}{2} \|u^*(\eta) - \eta\|_2^2 - \tfrac{\gamma}{2} \|u^{*\prime}(\eta)\|_2^2

Thus, the Fenchel–Young loss embodies the regression residual plus a TV or Sobolev penalty (Martins et al., 2021).

4. Efficient Computation and Differentiation

Numerical implementation of fusedmax proceeds by discretizing SS on a regular grid. The discrete TV-regularized problem becomes

minp0,p=1/h12pf2+hγipipi1\min_{p \geq 0, \sum p = 1/h} \tfrac{1}{2} \|p - f\|^2 + h\gamma \sum_i |p_i - p_{i-1}|

matching Euler's finite-difference discretization of ROF denoising (see Prop. C.1). O(nn) complexity is achieved via fused-lasso solvers, notably the taut-string algorithm. The Lagrange multiplier τ\tau is recovered by a one-dimensional root-finding on the normalization constraint.

Gradient propagation is straightforward: if y=fusedmax(η)y = \operatorname{fusedmax}(\eta), then LΩη=yq\frac{\partial L_\Omega}{\partial \eta} = y - q; one simply differentiates through the clip operation. Differentiation through the ROF-solver can leverage implicit differentiation of the KKT system (as in [28] of (Martins et al., 2021)) or a truncated unrolled primal-dual algorithm. In the Sobolev case, an additional linear ODE is solved during gradient computation (Martins et al., 2021).

5. Applications and Empirical Results

Continuous fusedmax yields interpretable and parsimonious attention densities:

  • Audio classification (UrbanSound8K): Substituting standard softmax attention with continuous fusedmax attention (with η\eta as a learnable Gaussian score) increases accuracy by approximately 3 percentage points. The resulting attention identifies contiguous bursts of audio, suppressing isolated frames/noise.
  • Visual question answering (VQA-v2): Replacing the discrete 14×1414 \times 14 attention grid with a single continuous fusedmax density over [0,1]2[0,1]^2—using either TV or Sobolev regularization—achieves comparable or superior accuracy with significantly fewer parameters. Fusedmax regions are observed to be compact ellipses matching queried objects.

An important advantage is enhanced interpretability: the attention density is forced to concentrate on a small number of contiguous support blocks, making the locus of "attention" in input space explicit (Martins et al., 2021).

6. FuseMax for Transformer Attention Acceleration

A separate but nomenclaturally related line of work, "FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design" (Nayak et al., 2024), proposes a hardware accelerator for transformer attention using the "cascade of Einsums" abstraction.

Key aspects include:

  • Modeling attention as a cascade of extended Einsums, allowing fine-grained analysis of data access patterns and computational dependencies.
  • Taxonomy via passes over input fibers: Traditional stable softmax implementations require three passes; optimizations reduce this to two or (in the case of FlashAttention-2 and FuseMax) one pass.
  • Spatial array design: FuseMax maps the 1-pass softmax cascade onto a spatial array with partitioned sequence and projection dimensions, achieving buffer requirements independent of full sequence length. Only blockwise tiles need to reside on-chip.
  • High utilization: Both 2D and 1D processing elements perform fused mixed operations (MACC, EXP); pipelining and tiling deliver sustained >95% utilization on both arrays for all sequence lengths.

In cycle-accurate simulation on workloads including BERT-Base and T5-small, FuseMax achieves an average attention-only speedup of 6.7×\times over FLAT (the prior state-of-the-art), using only 79% of the energy. End-to-end, it yields average 5.3×\times transformer inference speedup with 83% of the energy (Nayak et al., 2024).

7. Significance and Connections

Fusedmax provides a principled framework for structured sparsity and spatial localization within both continuous probability mapping and hardware acceleration. In the continuous case, it generalizes fused-lasso denoising to infinite-dimensional domains, retaining a well-behaved convex loss and efficient solvers, and admits interpretable sparsity for tasks such as attention-based sequence modeling. For hardware, FuseMax formalizes and fuses the core computation stages, enabling near-optimal utilization and bounded on-chip memory that breaks prior scaling laws.

These two threads illustrate the power of leveraging fusion—either in the functional or hardware domain—for both statistical expressivity and computational efficiency (Martins et al., 2021, Nayak et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fusedmax.