Papers
Topics
Authors
Recent
Search
2000 character limit reached

On the Computational Hardness of Transformers

Published 11 Mar 2026 in cs.CC and cs.LG | (2603.11332v1)

Abstract: The transformer has revolutionized modern AI across language, vision, and beyond. It consists of $L$ layers, each running $H$ attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input consists of $N$ tokens, each a vector of dimension $m$. The attention mechanism involves multiplying three $N \times m$ matrices, applying softmax to an intermediate product. Several recent works have advanced our understanding of the complexity of attention. Known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout TCS under the guise of ``direct sum'' problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design. Thus, we ask whether transformers can be computed more efficiently than $LH$ independent evaluations of attention. In this paper, we resolve this question in the negative, and give the first non-trivial computational lower bounds for multi-head multi-layer transformers. In the small embedding regime ($m = N{o(1)}$), computing $LH$ attention heads separately takes $LHN{2 + o(1)}$ time. We establish that this is essentially optimal under SETH. In the large embedding regime ($m = N$), one can compute $LH$ attention heads separately using $LHN{ω+ o(1)}$ arithmetic operations (plus exponents), where $ω$ is the matrix multiplication exponent. We establish that this is optimal, by showing that $LHN{ω- o(1)}$ arithmetic operations are necessary when $ω> 2$. Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm.

Summary

  • The paper shows that under SETH and k-OV assumptions, computing all attention heads independently is essentially optimal.
  • It leverages reductions from orthogonal vectors and the Baur-Strassen theorem to establish tight lower bounds in both small and large embedding regimes.
  • The findings imply that accelerating transformer computations requires approximations, hardware innovations, or architecture changes rather than new exact algorithms.

On the Computational Hardness of Transformers

Problem Overview and Motivation

The transformer architecture underpins state-of-the-art systems in NLP and CV, driven by multi-layer, multi-head attention mechanisms. Despite the empirical success, fundamental algorithmic questions around the computational efficiency of transformers remain unresolved. Specifically, can multiple attention heads and layers be computed substantially faster than their naive independent evaluation? This question aligns with the classic direct sum problem in theoretical computer science—whether multiple instances of a problem can be solved more efficiently in parallel than in sequence.

The paper "On the Computational Hardness of Transformers" (2603.11332) provides the first non-trivial lower bounds for the computational complexity of transformers. It establishes, under SETH and related fine-grained complexity assumptions, that the standard method of independently computing all attention heads in all layers is essentially optimal. Notably, the work precisely matches upper and lower bounds in both the small and large embedding regimes, using tools from fine-grained complexity and circuit complexity.

Main Results and Contributions

1. Lower Bounds—Small Embedding Regime:

When the embedding dimension mm is subpolynomial in the sequence length NN (specifically m=No(1)m = N^{o(1)}), the naive algorithm for an LL-layer, HH-head transformer computes all attention heads independently in O(LHN2+o(1))O(LHN^{2+o(1)}) time. Assuming the Strong Exponential Time Hypothesis (SETH), the orthogonal vectors hypothesis (specifically for k=3k=3), and reductions from unbalanced kk-OV, the authors show that any algorithm (not even necessarily using the transformer structure) must use at least LHN2o(1)LHN^{2-o(1)} time. Thus, amortized or parallel computation yields no asymptotic speedup over the naive baseline.

2. Lower Bounds—Large Embedding Regime:

When the embedding dimension m=Nm = N, multi-head attention can leverage fast matrix multiplication to improve upon the naive cubic time per head. The best-known algorithm achieves LHNω+o(1)LHN^{\omega+o(1)}, where ω\omega is the matrix multiplication exponent. The paper proves this is also essentially optimal in the extended arithmetic circuit model—even when exponentiation and logarithm gates are available—using a novel reduction via the Baur-Strassen theorem. The reduction shows that an extended arithmetic circuit computing such a transformer must be large enough to compute LHLH independent instances of matrix multiplication, incurring LHNωo(1)LHN^{\omega-o(1)} gate complexity whenever ω>2\omega > 2.

3. Extended Circuit Lower Bounds:

A key technical contribution is the extension of classical arithmetic circuit lower bounds to circuits augmented with non-algebraic gates (such as exp\exp and ln\ln), motivated by the structure of the attention operation. The work proves that these gates offer no advantage for computing low-degree functions (such as matrix multiplication outputs and derivatives), by analyzing their formal power series and leveraging extended versions of Strassen’s results.

4. Tightness and Matching Upper Bounds:

For both small and large embedding regimes, the presented lower bounds closely match existing upper bounds, modulo small (polylogarithmic) factors. Thus, improvements—conditional on the considered complexity assumptions—are excluded even with sophisticated circuit types and access to fast algebraic operations.

Notable Technical and Conceptual Innovations

  • Reduction from Orthogonal Vectors (OV) and kk-OV: The lower bounds for the small embedding regime exploit reductions from unbalanced kk-OV instances, showing that solving all attention heads "in parallel" cannot break the quadratic (in NN) time barrier without refuting SETH or the kk-OV hypothesis.
  • Use of the Baur-Strassen Theorem: For the large embedding regime, the authors generalize Baur-Strassen to the extended arithmetic circuit model to extract enough information (partial derivatives) from a “compressed” computation. They show that, to produce all outputs for LHLH matrix multiplications, the circuit must have enough gates to match the trivial approach, even allowing rich gatesets.
  • Treatment of Practical MLPs and Activation Functions: The lower bounds are robust to the inclusion of standard MLP layers (including GLUs, ReLU, and sigmoid), provided the computation remains of low circuit complexity per layer, further emphasizing the universality of the hardness results.

Implications

Theoretical Implications:

The results indicate that the quadratic (small mm) or NωN^\omega (large mm) scaling of transformer computation is not an artifact of implementation, but a fundamental computational barrier—conditional on SETH and matrix multiplication lower bounds. Furthermore, this places transformer computation solidly among the growing class of problems for which fine-grained complexity precludes radical algorithmic improvements unless major complexity-theoretic conjectures are refuted.

Practical Implications:

Efforts to accelerate transformer inference or training are compelled to rely on either hardware specialization, approximation, or architectural changes (e.g., sparse or low-rank attention), as there are no unconditional algorithmic savings to be had under standard complexity assumptions. The work directly rules out amortized improvements for large batch inference or speculative “multi-head fusion” optimizations presently pursued in systems and compiler communities.

Directions for Future Work:

The paper closes the door on algorithmic advances for exact transformer computation under the model and assumptions considered but leaves open alternative directions:

  • Randomized/Approximate Computation: Can efficient approximations circumvent these barriers without incurring significant accuracy degradation?
  • Other Computational Models: Extending unconditional hardness to the Word-RAM model or to practical hardware-focused models is left for future investigation.
  • Distributional and Structural Assumptions: Might real-world data distributions admit faster (even average-case) transformer algorithms? Structural restrictions on weights or inputs may open the door for more efficient computation.

Conclusion

"On the Computational Hardness of Transformers" rigorously characterizes the computational limits of exact transformer evaluation, grounding the anecdotal inefficiency of large-scale attention computation in fine-grained complexity assumptions and non-trivial lower bounds for extended arithmetic circuits. The paper closes the possibility of subquadratic or sub-NωN^\omega evaluation algorithms for multi-layer, multi-head transformers under accepted complexity hypotheses, shaping future work toward architectural, hardware, or approximate solutions.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper asks a simple-sounding but important question: when you run a Transformer (the kind of AI model used in chatbots and image tools), can you compute all its attention heads and layers much faster by doing them “together,” instead of one by one?

Transformers use “self‑attention” to let each word (or image patch) look at all the others and decide what matters. That step is powerful—but expensive. If your input has NN tokens (like words), one attention head roughly costs time that grows like N2N^2. A full Transformer stacks LL layers and uses HH heads per layer, so a straightforward way takes about L×H×N2L \times H \times N^2 work (ignoring some details).

This paper shows that, in general, you cannot do much better than that. In other words, there’s no big shortcut to compute many heads and layers at once—at least not without changing what attention is or relying on breakthroughs in other hard problems.

What questions do the authors ask?

  • Can we compute many attention heads (and layers) faster than computing each one separately?
  • Is there a “bulk discount” when you have lots of identical computations inside a Transformer?
  • If not, can we prove that the straightforward approach is basically the best we can do?

These are examples of a “direct sum” question: if you have many copies of the same task, can doing them together be significantly cheaper than doing them one at a time?

How did they study the problem?

To keep things concrete, the authors consider two settings for the “embedding dimension” mm (the size of each token’s vector):

  • Small embedding: mm grows slowly with NN (think “around logN\log N”).
  • Large embedding: mm is about the same size as NN.

They use ideas from theoretical computer science that let you argue about limits on speed—kind of like saying “if you could do this fast, then you could also solve another notoriously hard problem fast,” which most experts believe is unlikely.

Key ideas explained in everyday terms

  • Attention as “matching and weighing”: Each token creates “queries” and compares them to “keys” of other tokens. The better the match, the more weight it assigns to those tokens’ “values.” A softmax turns the match scores into smooth weights that sum to 1.
  • Why attention is costly: Every token compares with every other token, which is why the cost grows like N2N^2.
  • Conditional lower bounds: They assume widely believed ideas like SETH (the Strong Exponential Time Hypothesis), which basically says some hard problems can’t be solved super quickly. If a fast Transformer algorithm would break those beliefs, that’s strong evidence such an algorithm is unlikely to exist.
  • Matrix multiplication exponent ω\omega: This is a number that measures how fast the best-known algorithms can multiply big matrices. Today, ω\omega is a bit above 2 (around 2.37). If you can’t beat matrix multiplication for certain tasks, that sets a lower bound on time.
  • Circuits with exp/log: Attention uses exponentials (in softmax), so the authors study a realistic “recipe book” of basic operations that includes +, −, ×, ÷, exp, and log. They prove limits in this model.
  • Baur–Strassen theorem: A classic result that says if you can compute a function with a certain number of steps, you can also compute all its partial derivatives with only a constant-factor extra cost. Intuitively, it lets them extract a lot of hidden information from a computation, which helps prove lower bounds.

What did they find, and why is it important?

Here are the main takeaways.

  • Small embedding dimension (roughly mlogNm \approx \log N):
    • Result: Any algorithm needs about L×H×N2L \times H \times N^{2} time (up to tiny factors), even if you remove the MLPs.
    • Why: If you could do much better, you could also solve another famous hard problem much faster than believed, contradicting SETH.
    • Meaning: Computing each head separately—the “naive” method—is basically optimal.
  • Large embedding dimension (mm about NN):
    • Result: In a realistic model of computation (allowing +, −, ×, ÷, exp, log), any algorithm needs about L×H×NωL \times H \times N^{\omega} arithmetic steps (again up to tiny factors), assuming ω>2\omega > 2.
    • How they argue this: They show that if you could compute a Transformer much faster, you could also multiply many pairs of matrices faster than known limits. Using the Baur–Strassen theorem, they turn a fast Transformer computation into a way to recover many matrix products, implying a lower bound.
    • Meaning: Even when each token’s vector is large, the obvious approach (which relies on fast matrix multiplication) is basically as fast as possible.

Overall importance:

  • These results say there’s no big general-purpose shortcut for exact, full attention across many heads and layers. The cost really does add up linearly in the number of heads and layers.
  • This helps explain why practical speedups focus on approximations, sparsity, clever memory layouts (like FlashAttention), or hardware—rather than hoping for a big algorithmic speedup that keeps exact attention.

Why this matters and what it means

  • For researchers and engineers: Don’t expect dramatic universal speedups for exact multi-head, multi-layer attention just by “sharing work” across heads or layers. The straightforward method is close to the best you can do in the worst case.
  • For practice: Real-world systems will likely keep relying on:
    • Approximations to attention (which trade some accuracy for speed),
    • Architectural tweaks (changing how models use attention),
    • Hardware and software optimizations (better kernels, memory layouts),
    • Or smaller input lengths via chunking or hierarchical processing.
  • For theory: It’s a clean “no” to the direct-sum hope for Transformers: computing many attention heads doesn’t magically get cheaper than doing them one by one. The paper also introduces a neat use of the Baur–Strassen theorem to reason about Transformers, which could inspire new theoretical tools.

In short, the paper clarifies the limits of speeding up Transformers: exact attention is inherently costly, and that cost scales with the number of heads and layers.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of unresolved issues and concrete directions that the paper leaves open for future work:

  • Unconditional lower bounds at ω = 2: The large-embedding lower bound is unconditional only when the matrix multiplication exponent ω > 2; when ω = 2 it falls back to a SETH-based bound. Can one prove unconditional LHN{2−o(1)} lower bounds for transformers when ω = 2 (or rule them out under standard circuit lower bound barriers)?
  • Beyond the extended arithmetic circuit (eAC) model: The large-embedding result is proved in an eAC model (allowing +, −, ×, /, exp, ln). Are there matching lower bounds in other standard models (word-RAM with bit complexity, Turing machine, I/O/communication models, PRAM), and do they require new techniques beyond circuit arguments?
  • Precision, dynamic range, and numerical stability: The reductions use hardmax via softmax with large scaling (Θ(log NHL)) and exact reals. What lower bounds hold under finite precision, bounded dynamic range, and numerically stable implementations (e.g., log-sum-exp tricks, fused kernels), including explicit dependence on bit precision?
  • Randomized and approximate algorithms: The small-embedding lower bound is conditional and worst-case; the large-embedding bound is circuit-based. Do analogous lower bounds hold for randomized (Monte Carlo/Las Vegas) algorithms and for approximate inference (additive/multiplicative error in the final transformer output), with explicit error–time trade-offs?
  • Average-case and distributional hardness: Results are worst-case over weights and inputs. Can one show lower bounds for natural distributions (e.g., typical trained weight distributions, natural language token statistics) or under structural priors (e.g., low-rank/sparse Q/K/V, shared or tied weights), and characterize when amortization across heads/layers becomes possible?
  • Causal masking and decoding-time settings: The paper studies self-attention without masks. Do the lower bounds extend to causal/triangular masking and incremental (autoregressive) decoding with KV caching, where amortization across time steps is common?
  • Cross-attention, encoder–decoder, and multi-query/grouped-query variants: The results focus on standard self-attention. Are there analogous (tight) lower bounds for cross-attention, encoder–decoder transformers, MQA/GQA, and hybrid attention mechanisms used in practice?
  • Aggregation mechanisms and head concatenation: While the paper claims a reduction between sum and concatenation aggregation (appendix), a fully explicit, numerically stable, and low-overhead construction is not detailed in the main text. Can one give tight, model-robust lower bounds for the concatenation+output-projection scheme and other head-aggregation variants (e.g., gated or weighted sums)?
  • Impact of standard architectural components: LayerNorm/RMSNorm, positional encodings (sinusoidal, RoPE), biases, scaling by 1/√m, dropout, and residual scaling are not modeled in the lower bounds. Which of these components preserve the hardness results, and do any enable provable amortization across heads/layers?
  • MLPs and nonlinearities: The main results remove or idealize MLPs and focus on exp/ln gates. Do the lower bounds extend to transformers with practical MLPs and nonlinearities such as ReLU/GELU/swish, LayerNorm (involving sqrt), and other non-analytic operations, including quantized or piecewise-linear implementations?
  • Stronger fine-grained assumptions: The small-embedding lower bound relies on O3 (and thus SETH). Can it be based on weaker or different hypotheses (e.g., OVH, 3SUM, APSP) or shown in unconditional restricted models?
  • Approximation vs expressivity: For entire multi-layer multi-head transformers, the paper does not give complexity lower bounds for approximate outputs (beyond the hardmax-softmax approximation used inside the reduction). Can we quantify tight time–accuracy trade-offs for approximating full transformer inference?
  • Degree and eAC expressiveness: The large-embedding result uses a simulation argument that eACs provide no advantage for low-degree outputs (to import matrix multiplication lower bounds). Can this be generalized to a broader class of analytic gates/nonlinearities, or to higher-degree but structured computations relevant to attention?
  • Explicit normalized softmax construction: The sketch uses “denormalized” attention (exp-only) and claims a modification for normalized softmax. A detailed, explicit, and tight construction (including the cost of normalization and stability) is missing from the main text; making this fully explicit would strengthen the applicability of the lower bound.
  • Memory/I/O lower bounds: The results count arithmetic operations/gates. Are there matching lower bounds on memory footprint, data movement, and bandwidth (I/O complexity), especially given practical acceleration (e.g., FlashAttention-style tiling)?
  • Parallel and distributed settings: The paper does not provide lower bounds for parallel time or communication in distributed systems. Can one prove tight PRAM lower bounds and distributed/communication lower bounds for computing multi-head multi-layer transformers?
  • Training-time (backprop) complexity: While the Baur–Strassen theorem underlies the lower bound (and backprop in practice), the paper only analyzes inference. Are there corresponding lower bounds for computing gradients/backprop through multi-head multi-layer transformers (both per-step and amortized across layers/heads)?
  • Parameter sharing and reuse across layers/heads: The worst-case constructions assign distinct C-vectors (or matrices) per head/layer. Do the lower bounds persist under common forms of weight sharing (e.g., ALiBi, shared projections, tied layers), or can sharing provably enable faster-than-direct-sum computation?
  • Regimes beyond m = O(N): The results assume m ≤ O(N) or m ≈ N. What are the optimal bounds when m ≫ N or m ≪ N under realistic scaling laws, and can one derive tight upper/lower bounds as functions of (N, m, L, H) simultaneously?
  • Specialized structures enabling speedups: Can one characterize structural conditions (low rank, block-sparsity, Toeplitz/diagonal-plus-low-rank in Q/K/V or in attention patterns) under which multi-head/multi-layer amortization is algorithmically possible, and quantify the achievable savings?
  • Robustness to masking/constraints from tasks: Many practical tasks impose fixed masks (local, block-sparse, routing). Do similar lower bounds hold under these constraints, or can they be exploited for faster computation without accuracy loss?
  • Sum-of-products vs independent products: The reduction for the large-embedding regime converts transformers to many independent matrix products. Are there tighter lower bounds for the multi-output setting that directly capture the algebra of attention (e.g., softmax-normalized products), beyond leveraging generic matrix multiplication hardness?
  • Empirical thresholds and constants: The lower bounds are asymptotic. What are the concrete constant factors and finite-N regimes where direct-sum optimality manifests in practice, and how do they compare to state-of-the-art kernels and hardware?
  • Extensions to other similarity kernels: The analysis targets dot-product attention. Do analogous tight lower bounds hold for cosine-similarity, additive attention, or kernelized attention (e.g., with fixed feature maps), including when kernels admit fast transforms?

Practical Applications

Overview

This paper proves tight lower bounds on the computational cost of exact transformer inference. It shows that computing multiple attention heads and layers cannot be amortized beyond evaluating them independently:

  • Small embedding regime (m=No(1)m = N^{o(1)}): any algorithm needs LHN2o(1)L \cdot H \cdot N^{2 - o(1)} time (conditional on SETH/3OV).
  • Large embedding regime (m=Θ(N)m = \Theta(N)): any extended arithmetic circuit (allowing +, −, ×, ÷, exp, ln) needs LHNωo(1)L \cdot H \cdot N^{\omega - o(1)} operations when ω>2\omega > 2 (unconditional), matching fast-matrix-multiplication-based implementations.
  • The result leverages a novel application of the Baur–Strassen theorem and establishes a “direct-sum” style hardness: many heads/layers are, in the worst case, as hard as evaluating each separately.

Below are practical applications and implications grouped by immediacy.

Immediate Applications

The following items can be incorporated into current tools, workflows, and decision-making.

  • Compute planning and cost forecasting for AI workloads
    • Sectors: software/cloud, finance, energy, enterprise IT.
    • What to do: adopt complexity-based calculators and dashboards that estimate training/inference cost using the tight scaling laws (e.g., small-mm bound: cost ∝ LHN2L \cdot H \cdot N^2; large-mm bound: cost ∝ LHNωL \cdot H \cdot N^\omega with current ω2.37\omega \approx 2.37).
    • Tools/products/workflows: capacity planning tools for clusters; FinOps reporting that ties token length (NN), depth (LL), and heads (HH) to latency and CO2CO_2; SLA predictors for prompt-length vs. response-time.
    • Assumptions/dependencies: exact attention; SETH-conditional for small mm; unconditional for large mm in the extended arithmetic circuit model; constants and memory/bandwidth still matter in practice.
  • Architecture and hyperparameter selection under tight latency/cost budgets
    • Sectors: model engineering (all domains using LLMs/ViTs), robotics/edge, mobile.
    • What to do: prefer smaller NN (context/windowed attention), carefully cap HH and LL, and control mm to meet latency targets; adopt retrieval/chunking to keep effective NN small; pick sum vs. concat aggregation only for implementation reasons (the bounds apply to both).
    • Tools/products/workflows: design-time rules of thumb (e.g., doubling NN quadruples compute for small mm); automated hyperparameter tuners constrained by LHNpL \cdot H \cdot N^{p} budgets; windowed/sliding attention and chunk-level routing.
    • Assumptions/dependencies: applies to exact attention; approximate mechanisms can reduce cost but may degrade accuracy.
  • Compiler and kernel engineering priorities
    • Sectors: AI systems, hardware vendors, library developers.
    • What to do: focus on I/O and memory optimizations (e.g., FlashAttention), kernel fusion, and parallel scheduling across heads/layers to remove overheads—while recognizing arithmetic lower bounds prevent asymptotic speedups for exact attention.
    • Tools/products/workflows: graph compilers that parallelize heads (not amortize them); per-head/layer sharding strategies; operator fusion that reduces memory traffic.
    • Assumptions/dependencies: bounds target arithmetic complexity; practical gains still achievable via better memory locality/parallelism.
  • Serving policies and prompt management
    • Sectors: product teams deploying LLM features; consumer apps; security.
    • What to do: implement token-length caps and dynamic truncation policies; expose UI/UX feedback on “long prompt → higher latency/cost”; rate-limit or charge for very long contexts.
    • Tools/products/workflows: server schedulers that price/queue by estimated LHN2L \cdot H \cdot N^2 or LHNωL \cdot H \cdot N^\omega cost; prompt-length governance; autoscaling rules.
    • Assumptions/dependencies: impacts are strongest for exact attention; approximate or sparse attention changes constants/behavior.
  • Verification of “faster exact attention” claims
    • Sectors: industry R&D, academia, procurement.
    • What to do: benchmark and audit claims of subquadratic (small mm) or sub–NωN^\omega (large mm) exact attention; require stated assumptions (approximations, restricted inputs, or hardware/IO improvements) for any claimed asymptotic gains.
    • Tools/products/workflows: reproducible benchmarks; checklists for reviewers and buyers (is the method approximate? does it use structure?).
    • Assumptions/dependencies: worst-case lower bounds; structure-dependent speedups may be valid but are not general.
  • Domain workflows to keep effective context small
    • Sectors: healthcare (EHR summarization), legal, finance (report analysis), education (long documents), media.
    • What to do: prefer retrieval-augmented generation (RAG), chunk-and-summarize pipelines, hierarchical encoders, and streaming to control NN.
    • Tools/products/workflows: document pre-segmentation, hierarchical summaries, adaptive windowing for time series/sensor data, KV-cache compression.
    • Assumptions/dependencies: maintains downstream task quality; approximate across-chunk dependencies may lose fidelity versus full attention.
  • Parallel resource scheduling across heads and layers
    • Sectors: cloud providers, MLOps, HPC.
    • What to do: map heads/layers to parallel devices rather than trying to amortize their arithmetic; adopt pipeline and tensor parallelism with per-head sharding.
    • Tools/products/workflows: head-wise GPU partitioning; layer-wise pipelining; distributed scheduling that respects non-amortizable compute.
    • Assumptions/dependencies: network bandwidth and synchronization can dominate; memory constraints may require checkpointing.
  • Research prioritization toward approximate/structured methods
    • Sectors: academia, applied research labs.
    • What to do: focus on approximate attention (kernelized, sparse, low-rank), problem-structure exploitation (e.g., sparsity, locality), and distributional assumptions to escape worst-case bounds; quantify accuracy–efficiency trade-offs.
    • Tools/products/workflows: benchmarks that report both quality loss and speedup; evaluators for task- and distribution-specific gains.
    • Assumptions/dependencies: approximate methods can degrade accuracy as scale grows; guarantees are task- and data-dependent.
  • Sustainability and policy communication
    • Sectors: policy, ESG, datacenter operations.
    • What to do: report and plan energy usage using tight scaling laws; set realistic policy targets for efficiency; justify investments in memory- and matmul-optimized hardware.
    • Tools/products/workflows: sustainability dashboards that forecast emissions with LHNpL \cdot H \cdot N^{p}; procurement standards requesting complexity disclosures.
    • Assumptions/dependencies: real-world efficiency influenced by utilization and power management; lower bounds guide, but don’t fix, implementation inefficiencies.

Long-Term Applications

These directions require further research, infrastructure, or ecosystem evolution.

  • Hardware and algorithm roadmaps tied to matrix multiplication exponent
    • Sectors: semiconductor, AI accelerators, HPC.
    • Opportunity: reducing ω\omega (via algorithms or hardware primitives) gives proportional gains for large-mm transformers (exact attention ≈ LHNωL \cdot H \cdot N^\omega).
    • Tools/products/workflows: matmul-centric architectures; compiler support for novel fast-matmul schemes; co-design of numerical formats.
    • Assumptions/dependencies: fast-matmul constant factors and numerical stability; integration into training/inference stacks.
  • Architectures that circumvent worst-case lower bounds
    • Sectors: ML research, foundation model labs.
    • Opportunity: alternative attention mechanisms, external memory/indexing, routing, or modular architectures that reduce dependence on all-pairs interactions or exact softmax.
    • Tools/products/workflows: hybrid search+attention, learned retrieval, adaptive sparsity, hierarchical tokenization, compressed state machines.
    • Assumptions/dependencies: may trade exactness for approximation or impose structure; theoretical escape from bounds depends on changing the problem/model class.
  • Complexity-aware neural compilers and auto-diff systems
    • Sectors: ML compilers, frameworks.
    • Opportunity: leverage the paper’s extended Baur–Strassen perspective to design compilers and AD passes that reason about gradient cost and information extraction; establish lower bounds for training-time computations in transformer-like graphs.
    • Tools/products/workflows: IR passes that detect non-amortizable subgraphs; gradient reuse schedules guided by circuit-theoretic limits.
    • Assumptions/dependencies: mapping between circuit lower bounds and practical graph transformations requires careful engineering.
  • Complexity-informed Neural Architecture Search (NAS) and AutoML
    • Sectors: AutoML platforms, enterprise ML.
    • Opportunity: incorporate tight compute budgets using LHNpL \cdot H \cdot N^{p} constraints in objective functions for multi-objective NAS (latency, energy, cost, quality).
    • Tools/products/workflows: NAS controllers with explicit asymptotic penalties; workload-aware hyperparameter tuners.
    • Assumptions/dependencies: surrogate latency models must be calibrated to hardware; may need task-specific constraints.
  • Standards and auditing for claimed efficiency improvements
    • Sectors: standards bodies, regulators, industry consortia.
    • Opportunity: develop benchmarks/criteria distinguishing exact vs. approximate attention and worst-case vs. structured inputs, aligned with the proven lower bounds.
    • Tools/products/workflows: certification suites; disclosure templates (compute complexity class, approximation regime).
    • Assumptions/dependencies: adoption depends on ecosystem consensus and vendor cooperation.
  • Tokenization and data governance to reduce effective context length
    • Sectors: data engineering, content platforms.
    • Opportunity: evolve tokenization standards that yield smaller NN without harming utility; pipeline-level summarization and compression policies.
    • Tools/products/workflows: semantic tokenizers, chunk-level deduplication, multilingual compression strategies.
    • Assumptions/dependencies: downstream task performance and fairness require careful validation.
  • Domain-tailored subquadratic guarantees
    • Sectors: finance (time series), healthcare (structured EHR), scientific computing (sparse signals).
    • Opportunity: prove and exploit subquadratic algorithms under domain structure assumptions (sparsity, bounded alphabet, locality).
    • Tools/products/workflows: structured kernels, block-sparse attention with theoretical guarantees, problem-specific caching.
    • Assumptions/dependencies: gains are not worst-case; must validate structure holds in production.
  • Education and workforce development
    • Sectors: academia, training programs.
    • Opportunity: integrate direct-sum principles and transformer hardness into curricula to set realistic expectations for acceleration opportunities.
    • Tools/products/workflows: course modules, lab exercises using the LHNpL \cdot H \cdot N^{p} scaling and Baur–Strassen insights.
    • Assumptions/dependencies: materials must bridge theory and systems practice to be effective.

Notes on Assumptions and Scope

  • Small-mm lower bound relies on SETH/3OV; large-mm lower bound is unconditional in the extended arithmetic circuit model (with exp/ln) when ω>2\omega > 2. If ω=2\omega = 2, the small-mm bound gives the matching LHN2o(1)L \cdot H \cdot N^{2 - o(1)} rate.
  • Results target exact attention; approximate or structured attention can beat these bounds at potential accuracy trade-offs or under distributional/structural assumptions.
  • Sum vs. concatenation aggregation does not alter the lower bounds materially (sum reduces to concat via a linear map).
  • Hardware, memory bandwidth, and software engineering can still deliver significant constant-factor gains even when asymptotic arithmetic complexity is tight.

Glossary

  • 3-OV Hypothesis: A fine-grained hardness conjecture asserting that detecting an orthogonal triple across three sets of Boolean vectors needs near-cubic time. "The 3{3} Hypothesis states that finding a triplet of orthogonal vectors among a set of nn vectors from {0,1}Θ(logn)\{0, 1\}^{\Theta(\log n)} requires n3o(1)n^{3 - o(1)} time."
  • 3-SUM: The problem of determining whether any three numbers (often integers) sum to zero; widely used as a fine-grained hardness assumption. "For example, it is conjectured that any transformer that is capable of computing 3-SUM requires polynomial size (i.e. mHL=N1mHL = N^{1})~\cite{DBLP:conf/nips/SanfordHT23, DBLP:conf/icml/Sanford0T24}."
  • Arithmetic circuit: A computational model using gates for +, −, ×, ÷ over a field or ring to compute algebraic functions; the size is the number of gates. "In the standard arithmetic circuit model, an algorithm only uses standard arithmetic operations $#1{+, -, \times, /}$."
  • Backpropagation algorithm: The standard method for efficiently computing gradients in neural networks via the chain rule. "Perhaps the most famous application of the Baur-Strassen theorem is the backpropagation algorithm, a fundamental building block of efficient training of neural networks \cite{rumelhart1986learning}."
  • Baur-Strassen theorem: A result stating that all partial derivatives of a function computed by an arithmetic circuit can be computed with only constant-factor overhead in circuit size. "Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm."
  • Communication complexity: The study of the minimum communication required between parties to compute a function of distributed inputs. "Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design."
  • Concatenation aggregation: A way to combine outputs of multiple attention heads by concatenating their vectors (as opposed to summing). "In \Cref{app:head-aggregation}, we reduce summation aggregation to concatenation aggregation."
  • Denormalization trick: A technique to remove softmax normalization so that exponentiated scores can be accessed directly. "We can use a simple denormalization trick to obtain obtain exp(QK)\exp(QK^{}) instead (see \Cref{app:omitted-proofs}), from which we can compute ABTAB^T"
  • Denormalized attention head: An attention variant that replaces the row-wise softmax with entry-wise exponentiation, yielding outputs of the form exp(QKT)V. "we call denormalized attention heads, which replaces the row-wise softmaxsoftmax with entry-wise exp\exp."
  • Direct sum problem: The question of whether solving many instances together can be done asymptotically faster than solving each instance separately. "This is a recurring question in theoretical computer science, typically known as the ``direct sum'' problem, and there are numerous positive and negative examples to this question."
  • Embedding dimension: The dimensionality of token representations (hidden vectors) used inside the model. "where mm is known as the embedding dimension,"
  • Extended arithmetic circuit (eAC): An arithmetic circuit augmented with exponential and logarithmic gates to model computations like attention. "Thus, it is natural to also allow exponential gates and logarithmic gates in the arithmetic circuits we consider, and we call them extended arithmetic circuits (eACs)."
  • Fine-grained complexity: A framework studying tight running-time lower bounds under plausible conjectures for specific problems. "a popular conjecture in fine-grained complexity that states that there is no 2(1)n2^{(1 - )n} algorithm for satisfiability on nn variables"
  • Hardmax attention: An attention mechanism that assigns weight only to keys achieving the maximal score, averaging uniformly over argmax positions. "A hardmax attention head then outputs Zh,:=hardmax(Qh,(X)Kh,(X))Vh,(X)Z_{h, \ell} := hardmax(Q_{h, \ell}(X) K_{h, \ell}(X)^{})V_{h, \ell}(X) where hardmaxhardmax is applied row-wise."
  • Massively Parallel Computation (MPC) model: A parallel computation model with many machines and communication rounds; used for theoretical comparisons to transformers. "connects the transformer model with the Massively Parallel Computation model by proving an equivalence between transformer and MPC protocol under certain parameters,"
  • Matrix multiplication exponent (ω): The infimum exponent such that n×n matrix multiplication can be done in O(nω+o(1)) operations. "where ω\omega is the matrix multiplication exponent."
  • MLP (multi-layer perceptron): A small feed-forward neural network applied row-wise in transformers as the feed-forward blocks. "before finally applying a row-wise multi-layer perceptron (MLP) function."
  • Multi-head attention: Running multiple attention heads in parallel within a layer and aggregating their outputs. "It consists of LL layers of multi-head attention, where each layer runs HH attention heads in parallel"
  • Orthogonal Vectors (OV) Hypothesis: A conjecture that OV cannot be solved in truly subquadratic time, often used to derive conditional lower bounds. "the trivial algorithm of computing matrix products explicitly is optimal for all mm under a generalization of the OVOV Hypothesis"
  • Query/Key/Value (Q, K, V) embeddings: The three linear embeddings used in attention to produce queries, keys, and values from inputs. "the attention head consists of query, key and value embedding maps Q,K,V:RN×mRN×mQ,K,V: \mathbb{R}^{N \times m} \rightarrow \mathbb{R}^{N \times m}"
  • SETH (Strong Exponential Time Hypothesis): The conjecture that SAT cannot be solved in time 21−εn for any ε>0; widely used for conditional lower bounds. "We establish that this is essentially optimal under the Strong Exponential Time Hypothesis (SETH)."
  • Softmax: The normalized exponential function producing a probability distribution; applied row-wise in attention. "while applying a softmax operation to the intermediate product of the first two matrices."
  • split-VC dimension: A complexity measure used in learning theory and recent transformer expressivity lower bounds. "and split-VC dimension~\cite{kozachinskiy2025strassenattentionsplitvc},"
  • Universal approximation theorem: The result that MLPs can approximate any continuous function (on compact sets) given sufficient width. "By the universal approximation theorem which states that any continuous function can be approximated by a MLP (with enough neurons),"

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 187 likes about this paper.