On the Computational Hardness of Transformers
Abstract: The transformer has revolutionized modern AI across language, vision, and beyond. It consists of $L$ layers, each running $H$ attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input consists of $N$ tokens, each a vector of dimension $m$. The attention mechanism involves multiplying three $N \times m$ matrices, applying softmax to an intermediate product. Several recent works have advanced our understanding of the complexity of attention. Known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout TCS under the guise of ``direct sum'' problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design. Thus, we ask whether transformers can be computed more efficiently than $LH$ independent evaluations of attention. In this paper, we resolve this question in the negative, and give the first non-trivial computational lower bounds for multi-head multi-layer transformers. In the small embedding regime ($m = N{o(1)}$), computing $LH$ attention heads separately takes $LHN{2 + o(1)}$ time. We establish that this is essentially optimal under SETH. In the large embedding regime ($m = N$), one can compute $LH$ attention heads separately using $LHN{ω+ o(1)}$ arithmetic operations (plus exponents), where $ω$ is the matrix multiplication exponent. We establish that this is optimal, by showing that $LHN{ω- o(1)}$ arithmetic operations are necessary when $ω> 2$. Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper asks a simple-sounding but important question: when you run a Transformer (the kind of AI model used in chatbots and image tools), can you compute all its attention heads and layers much faster by doing them “together,” instead of one by one?
Transformers use “self‑attention” to let each word (or image patch) look at all the others and decide what matters. That step is powerful—but expensive. If your input has tokens (like words), one attention head roughly costs time that grows like . A full Transformer stacks layers and uses heads per layer, so a straightforward way takes about work (ignoring some details).
This paper shows that, in general, you cannot do much better than that. In other words, there’s no big shortcut to compute many heads and layers at once—at least not without changing what attention is or relying on breakthroughs in other hard problems.
What questions do the authors ask?
- Can we compute many attention heads (and layers) faster than computing each one separately?
- Is there a “bulk discount” when you have lots of identical computations inside a Transformer?
- If not, can we prove that the straightforward approach is basically the best we can do?
These are examples of a “direct sum” question: if you have many copies of the same task, can doing them together be significantly cheaper than doing them one at a time?
How did they study the problem?
To keep things concrete, the authors consider two settings for the “embedding dimension” (the size of each token’s vector):
- Small embedding: grows slowly with (think “around ”).
- Large embedding: is about the same size as .
They use ideas from theoretical computer science that let you argue about limits on speed—kind of like saying “if you could do this fast, then you could also solve another notoriously hard problem fast,” which most experts believe is unlikely.
Key ideas explained in everyday terms
- Attention as “matching and weighing”: Each token creates “queries” and compares them to “keys” of other tokens. The better the match, the more weight it assigns to those tokens’ “values.” A softmax turns the match scores into smooth weights that sum to 1.
- Why attention is costly: Every token compares with every other token, which is why the cost grows like .
- Conditional lower bounds: They assume widely believed ideas like SETH (the Strong Exponential Time Hypothesis), which basically says some hard problems can’t be solved super quickly. If a fast Transformer algorithm would break those beliefs, that’s strong evidence such an algorithm is unlikely to exist.
- Matrix multiplication exponent : This is a number that measures how fast the best-known algorithms can multiply big matrices. Today, is a bit above 2 (around 2.37). If you can’t beat matrix multiplication for certain tasks, that sets a lower bound on time.
- Circuits with exp/log: Attention uses exponentials (in softmax), so the authors study a realistic “recipe book” of basic operations that includes +, −, ×, ÷, exp, and log. They prove limits in this model.
- Baur–Strassen theorem: A classic result that says if you can compute a function with a certain number of steps, you can also compute all its partial derivatives with only a constant-factor extra cost. Intuitively, it lets them extract a lot of hidden information from a computation, which helps prove lower bounds.
What did they find, and why is it important?
Here are the main takeaways.
- Small embedding dimension (roughly ):
- Result: Any algorithm needs about time (up to tiny factors), even if you remove the MLPs.
- Why: If you could do much better, you could also solve another famous hard problem much faster than believed, contradicting SETH.
- Meaning: Computing each head separately—the “naive” method—is basically optimal.
- Large embedding dimension ( about ):
- Result: In a realistic model of computation (allowing +, −, ×, ÷, exp, log), any algorithm needs about arithmetic steps (again up to tiny factors), assuming .
- How they argue this: They show that if you could compute a Transformer much faster, you could also multiply many pairs of matrices faster than known limits. Using the Baur–Strassen theorem, they turn a fast Transformer computation into a way to recover many matrix products, implying a lower bound.
- Meaning: Even when each token’s vector is large, the obvious approach (which relies on fast matrix multiplication) is basically as fast as possible.
Overall importance:
- These results say there’s no big general-purpose shortcut for exact, full attention across many heads and layers. The cost really does add up linearly in the number of heads and layers.
- This helps explain why practical speedups focus on approximations, sparsity, clever memory layouts (like FlashAttention), or hardware—rather than hoping for a big algorithmic speedup that keeps exact attention.
Why this matters and what it means
- For researchers and engineers: Don’t expect dramatic universal speedups for exact multi-head, multi-layer attention just by “sharing work” across heads or layers. The straightforward method is close to the best you can do in the worst case.
- For practice: Real-world systems will likely keep relying on:
- Approximations to attention (which trade some accuracy for speed),
- Architectural tweaks (changing how models use attention),
- Hardware and software optimizations (better kernels, memory layouts),
- Or smaller input lengths via chunking or hierarchical processing.
- For theory: It’s a clean “no” to the direct-sum hope for Transformers: computing many attention heads doesn’t magically get cheaper than doing them one by one. The paper also introduces a neat use of the Baur–Strassen theorem to reason about Transformers, which could inspire new theoretical tools.
In short, the paper clarifies the limits of speeding up Transformers: exact attention is inherently costly, and that cost scales with the number of heads and layers.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of unresolved issues and concrete directions that the paper leaves open for future work:
- Unconditional lower bounds at ω = 2: The large-embedding lower bound is unconditional only when the matrix multiplication exponent ω > 2; when ω = 2 it falls back to a SETH-based bound. Can one prove unconditional LHN{2−o(1)} lower bounds for transformers when ω = 2 (or rule them out under standard circuit lower bound barriers)?
- Beyond the extended arithmetic circuit (eAC) model: The large-embedding result is proved in an eAC model (allowing +, −, ×, /, exp, ln). Are there matching lower bounds in other standard models (word-RAM with bit complexity, Turing machine, I/O/communication models, PRAM), and do they require new techniques beyond circuit arguments?
- Precision, dynamic range, and numerical stability: The reductions use hardmax via softmax with large scaling (Θ(log NHL)) and exact reals. What lower bounds hold under finite precision, bounded dynamic range, and numerically stable implementations (e.g., log-sum-exp tricks, fused kernels), including explicit dependence on bit precision?
- Randomized and approximate algorithms: The small-embedding lower bound is conditional and worst-case; the large-embedding bound is circuit-based. Do analogous lower bounds hold for randomized (Monte Carlo/Las Vegas) algorithms and for approximate inference (additive/multiplicative error in the final transformer output), with explicit error–time trade-offs?
- Average-case and distributional hardness: Results are worst-case over weights and inputs. Can one show lower bounds for natural distributions (e.g., typical trained weight distributions, natural language token statistics) or under structural priors (e.g., low-rank/sparse Q/K/V, shared or tied weights), and characterize when amortization across heads/layers becomes possible?
- Causal masking and decoding-time settings: The paper studies self-attention without masks. Do the lower bounds extend to causal/triangular masking and incremental (autoregressive) decoding with KV caching, where amortization across time steps is common?
- Cross-attention, encoder–decoder, and multi-query/grouped-query variants: The results focus on standard self-attention. Are there analogous (tight) lower bounds for cross-attention, encoder–decoder transformers, MQA/GQA, and hybrid attention mechanisms used in practice?
- Aggregation mechanisms and head concatenation: While the paper claims a reduction between sum and concatenation aggregation (appendix), a fully explicit, numerically stable, and low-overhead construction is not detailed in the main text. Can one give tight, model-robust lower bounds for the concatenation+output-projection scheme and other head-aggregation variants (e.g., gated or weighted sums)?
- Impact of standard architectural components: LayerNorm/RMSNorm, positional encodings (sinusoidal, RoPE), biases, scaling by 1/√m, dropout, and residual scaling are not modeled in the lower bounds. Which of these components preserve the hardness results, and do any enable provable amortization across heads/layers?
- MLPs and nonlinearities: The main results remove or idealize MLPs and focus on exp/ln gates. Do the lower bounds extend to transformers with practical MLPs and nonlinearities such as ReLU/GELU/swish, LayerNorm (involving sqrt), and other non-analytic operations, including quantized or piecewise-linear implementations?
- Stronger fine-grained assumptions: The small-embedding lower bound relies on O3 (and thus SETH). Can it be based on weaker or different hypotheses (e.g., OVH, 3SUM, APSP) or shown in unconditional restricted models?
- Approximation vs expressivity: For entire multi-layer multi-head transformers, the paper does not give complexity lower bounds for approximate outputs (beyond the hardmax-softmax approximation used inside the reduction). Can we quantify tight time–accuracy trade-offs for approximating full transformer inference?
- Degree and eAC expressiveness: The large-embedding result uses a simulation argument that eACs provide no advantage for low-degree outputs (to import matrix multiplication lower bounds). Can this be generalized to a broader class of analytic gates/nonlinearities, or to higher-degree but structured computations relevant to attention?
- Explicit normalized softmax construction: The sketch uses “denormalized” attention (exp-only) and claims a modification for normalized softmax. A detailed, explicit, and tight construction (including the cost of normalization and stability) is missing from the main text; making this fully explicit would strengthen the applicability of the lower bound.
- Memory/I/O lower bounds: The results count arithmetic operations/gates. Are there matching lower bounds on memory footprint, data movement, and bandwidth (I/O complexity), especially given practical acceleration (e.g., FlashAttention-style tiling)?
- Parallel and distributed settings: The paper does not provide lower bounds for parallel time or communication in distributed systems. Can one prove tight PRAM lower bounds and distributed/communication lower bounds for computing multi-head multi-layer transformers?
- Training-time (backprop) complexity: While the Baur–Strassen theorem underlies the lower bound (and backprop in practice), the paper only analyzes inference. Are there corresponding lower bounds for computing gradients/backprop through multi-head multi-layer transformers (both per-step and amortized across layers/heads)?
- Parameter sharing and reuse across layers/heads: The worst-case constructions assign distinct C-vectors (or matrices) per head/layer. Do the lower bounds persist under common forms of weight sharing (e.g., ALiBi, shared projections, tied layers), or can sharing provably enable faster-than-direct-sum computation?
- Regimes beyond m = O(N): The results assume m ≤ O(N) or m ≈ N. What are the optimal bounds when m ≫ N or m ≪ N under realistic scaling laws, and can one derive tight upper/lower bounds as functions of (N, m, L, H) simultaneously?
- Specialized structures enabling speedups: Can one characterize structural conditions (low rank, block-sparsity, Toeplitz/diagonal-plus-low-rank in Q/K/V or in attention patterns) under which multi-head/multi-layer amortization is algorithmically possible, and quantify the achievable savings?
- Robustness to masking/constraints from tasks: Many practical tasks impose fixed masks (local, block-sparse, routing). Do similar lower bounds hold under these constraints, or can they be exploited for faster computation without accuracy loss?
- Sum-of-products vs independent products: The reduction for the large-embedding regime converts transformers to many independent matrix products. Are there tighter lower bounds for the multi-output setting that directly capture the algebra of attention (e.g., softmax-normalized products), beyond leveraging generic matrix multiplication hardness?
- Empirical thresholds and constants: The lower bounds are asymptotic. What are the concrete constant factors and finite-N regimes where direct-sum optimality manifests in practice, and how do they compare to state-of-the-art kernels and hardware?
- Extensions to other similarity kernels: The analysis targets dot-product attention. Do analogous tight lower bounds hold for cosine-similarity, additive attention, or kernelized attention (e.g., with fixed feature maps), including when kernels admit fast transforms?
Practical Applications
Overview
This paper proves tight lower bounds on the computational cost of exact transformer inference. It shows that computing multiple attention heads and layers cannot be amortized beyond evaluating them independently:
- Small embedding regime (): any algorithm needs time (conditional on SETH/3OV).
- Large embedding regime (): any extended arithmetic circuit (allowing +, −, ×, ÷, exp, ln) needs operations when (unconditional), matching fast-matrix-multiplication-based implementations.
- The result leverages a novel application of the Baur–Strassen theorem and establishes a “direct-sum” style hardness: many heads/layers are, in the worst case, as hard as evaluating each separately.
Below are practical applications and implications grouped by immediacy.
Immediate Applications
The following items can be incorporated into current tools, workflows, and decision-making.
- Compute planning and cost forecasting for AI workloads
- Sectors: software/cloud, finance, energy, enterprise IT.
- What to do: adopt complexity-based calculators and dashboards that estimate training/inference cost using the tight scaling laws (e.g., small- bound: cost ∝ ; large- bound: cost ∝ with current ).
- Tools/products/workflows: capacity planning tools for clusters; FinOps reporting that ties token length (), depth (), and heads () to latency and ; SLA predictors for prompt-length vs. response-time.
- Assumptions/dependencies: exact attention; SETH-conditional for small ; unconditional for large in the extended arithmetic circuit model; constants and memory/bandwidth still matter in practice.
- Architecture and hyperparameter selection under tight latency/cost budgets
- Sectors: model engineering (all domains using LLMs/ViTs), robotics/edge, mobile.
- What to do: prefer smaller (context/windowed attention), carefully cap and , and control to meet latency targets; adopt retrieval/chunking to keep effective small; pick sum vs. concat aggregation only for implementation reasons (the bounds apply to both).
- Tools/products/workflows: design-time rules of thumb (e.g., doubling quadruples compute for small ); automated hyperparameter tuners constrained by budgets; windowed/sliding attention and chunk-level routing.
- Assumptions/dependencies: applies to exact attention; approximate mechanisms can reduce cost but may degrade accuracy.
- Compiler and kernel engineering priorities
- Sectors: AI systems, hardware vendors, library developers.
- What to do: focus on I/O and memory optimizations (e.g., FlashAttention), kernel fusion, and parallel scheduling across heads/layers to remove overheads—while recognizing arithmetic lower bounds prevent asymptotic speedups for exact attention.
- Tools/products/workflows: graph compilers that parallelize heads (not amortize them); per-head/layer sharding strategies; operator fusion that reduces memory traffic.
- Assumptions/dependencies: bounds target arithmetic complexity; practical gains still achievable via better memory locality/parallelism.
- Serving policies and prompt management
- Sectors: product teams deploying LLM features; consumer apps; security.
- What to do: implement token-length caps and dynamic truncation policies; expose UI/UX feedback on “long prompt → higher latency/cost”; rate-limit or charge for very long contexts.
- Tools/products/workflows: server schedulers that price/queue by estimated or cost; prompt-length governance; autoscaling rules.
- Assumptions/dependencies: impacts are strongest for exact attention; approximate or sparse attention changes constants/behavior.
- Verification of “faster exact attention” claims
- Sectors: industry R&D, academia, procurement.
- What to do: benchmark and audit claims of subquadratic (small ) or sub– (large ) exact attention; require stated assumptions (approximations, restricted inputs, or hardware/IO improvements) for any claimed asymptotic gains.
- Tools/products/workflows: reproducible benchmarks; checklists for reviewers and buyers (is the method approximate? does it use structure?).
- Assumptions/dependencies: worst-case lower bounds; structure-dependent speedups may be valid but are not general.
- Domain workflows to keep effective context small
- Sectors: healthcare (EHR summarization), legal, finance (report analysis), education (long documents), media.
- What to do: prefer retrieval-augmented generation (RAG), chunk-and-summarize pipelines, hierarchical encoders, and streaming to control .
- Tools/products/workflows: document pre-segmentation, hierarchical summaries, adaptive windowing for time series/sensor data, KV-cache compression.
- Assumptions/dependencies: maintains downstream task quality; approximate across-chunk dependencies may lose fidelity versus full attention.
- Parallel resource scheduling across heads and layers
- Sectors: cloud providers, MLOps, HPC.
- What to do: map heads/layers to parallel devices rather than trying to amortize their arithmetic; adopt pipeline and tensor parallelism with per-head sharding.
- Tools/products/workflows: head-wise GPU partitioning; layer-wise pipelining; distributed scheduling that respects non-amortizable compute.
- Assumptions/dependencies: network bandwidth and synchronization can dominate; memory constraints may require checkpointing.
- Research prioritization toward approximate/structured methods
- Sectors: academia, applied research labs.
- What to do: focus on approximate attention (kernelized, sparse, low-rank), problem-structure exploitation (e.g., sparsity, locality), and distributional assumptions to escape worst-case bounds; quantify accuracy–efficiency trade-offs.
- Tools/products/workflows: benchmarks that report both quality loss and speedup; evaluators for task- and distribution-specific gains.
- Assumptions/dependencies: approximate methods can degrade accuracy as scale grows; guarantees are task- and data-dependent.
- Sustainability and policy communication
- Sectors: policy, ESG, datacenter operations.
- What to do: report and plan energy usage using tight scaling laws; set realistic policy targets for efficiency; justify investments in memory- and matmul-optimized hardware.
- Tools/products/workflows: sustainability dashboards that forecast emissions with ; procurement standards requesting complexity disclosures.
- Assumptions/dependencies: real-world efficiency influenced by utilization and power management; lower bounds guide, but don’t fix, implementation inefficiencies.
Long-Term Applications
These directions require further research, infrastructure, or ecosystem evolution.
- Hardware and algorithm roadmaps tied to matrix multiplication exponent
- Sectors: semiconductor, AI accelerators, HPC.
- Opportunity: reducing (via algorithms or hardware primitives) gives proportional gains for large- transformers (exact attention ≈ ).
- Tools/products/workflows: matmul-centric architectures; compiler support for novel fast-matmul schemes; co-design of numerical formats.
- Assumptions/dependencies: fast-matmul constant factors and numerical stability; integration into training/inference stacks.
- Architectures that circumvent worst-case lower bounds
- Sectors: ML research, foundation model labs.
- Opportunity: alternative attention mechanisms, external memory/indexing, routing, or modular architectures that reduce dependence on all-pairs interactions or exact softmax.
- Tools/products/workflows: hybrid search+attention, learned retrieval, adaptive sparsity, hierarchical tokenization, compressed state machines.
- Assumptions/dependencies: may trade exactness for approximation or impose structure; theoretical escape from bounds depends on changing the problem/model class.
- Complexity-aware neural compilers and auto-diff systems
- Sectors: ML compilers, frameworks.
- Opportunity: leverage the paper’s extended Baur–Strassen perspective to design compilers and AD passes that reason about gradient cost and information extraction; establish lower bounds for training-time computations in transformer-like graphs.
- Tools/products/workflows: IR passes that detect non-amortizable subgraphs; gradient reuse schedules guided by circuit-theoretic limits.
- Assumptions/dependencies: mapping between circuit lower bounds and practical graph transformations requires careful engineering.
- Complexity-informed Neural Architecture Search (NAS) and AutoML
- Sectors: AutoML platforms, enterprise ML.
- Opportunity: incorporate tight compute budgets using constraints in objective functions for multi-objective NAS (latency, energy, cost, quality).
- Tools/products/workflows: NAS controllers with explicit asymptotic penalties; workload-aware hyperparameter tuners.
- Assumptions/dependencies: surrogate latency models must be calibrated to hardware; may need task-specific constraints.
- Standards and auditing for claimed efficiency improvements
- Sectors: standards bodies, regulators, industry consortia.
- Opportunity: develop benchmarks/criteria distinguishing exact vs. approximate attention and worst-case vs. structured inputs, aligned with the proven lower bounds.
- Tools/products/workflows: certification suites; disclosure templates (compute complexity class, approximation regime).
- Assumptions/dependencies: adoption depends on ecosystem consensus and vendor cooperation.
- Tokenization and data governance to reduce effective context length
- Sectors: data engineering, content platforms.
- Opportunity: evolve tokenization standards that yield smaller without harming utility; pipeline-level summarization and compression policies.
- Tools/products/workflows: semantic tokenizers, chunk-level deduplication, multilingual compression strategies.
- Assumptions/dependencies: downstream task performance and fairness require careful validation.
- Domain-tailored subquadratic guarantees
- Sectors: finance (time series), healthcare (structured EHR), scientific computing (sparse signals).
- Opportunity: prove and exploit subquadratic algorithms under domain structure assumptions (sparsity, bounded alphabet, locality).
- Tools/products/workflows: structured kernels, block-sparse attention with theoretical guarantees, problem-specific caching.
- Assumptions/dependencies: gains are not worst-case; must validate structure holds in production.
- Education and workforce development
- Sectors: academia, training programs.
- Opportunity: integrate direct-sum principles and transformer hardness into curricula to set realistic expectations for acceleration opportunities.
- Tools/products/workflows: course modules, lab exercises using the scaling and Baur–Strassen insights.
- Assumptions/dependencies: materials must bridge theory and systems practice to be effective.
Notes on Assumptions and Scope
- Small- lower bound relies on SETH/3OV; large- lower bound is unconditional in the extended arithmetic circuit model (with exp/ln) when . If , the small- bound gives the matching rate.
- Results target exact attention; approximate or structured attention can beat these bounds at potential accuracy trade-offs or under distributional/structural assumptions.
- Sum vs. concatenation aggregation does not alter the lower bounds materially (sum reduces to concat via a linear map).
- Hardware, memory bandwidth, and software engineering can still deliver significant constant-factor gains even when asymptotic arithmetic complexity is tight.
Glossary
- 3-OV Hypothesis: A fine-grained hardness conjecture asserting that detecting an orthogonal triple across three sets of Boolean vectors needs near-cubic time. "The Hypothesis states that finding a triplet of orthogonal vectors among a set of vectors from requires time."
- 3-SUM: The problem of determining whether any three numbers (often integers) sum to zero; widely used as a fine-grained hardness assumption. "For example, it is conjectured that any transformer that is capable of computing 3-SUM requires polynomial size (i.e. )~\cite{DBLP:conf/nips/SanfordHT23, DBLP:conf/icml/Sanford0T24}."
- Arithmetic circuit: A computational model using gates for +, −, ×, ÷ over a field or ring to compute algebraic functions; the size is the number of gates. "In the standard arithmetic circuit model, an algorithm only uses standard arithmetic operations $#1{+, -, \times, /}$."
- Backpropagation algorithm: The standard method for efficiently computing gradients in neural networks via the chain rule. "Perhaps the most famous application of the Baur-Strassen theorem is the backpropagation algorithm, a fundamental building block of efficient training of neural networks \cite{rumelhart1986learning}."
- Baur-Strassen theorem: A result stating that all partial derivatives of a function computed by an arithmetic circuit can be computed with only constant-factor overhead in circuit size. "Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm."
- Communication complexity: The study of the minimum communication required between parties to compute a function of distributed inputs. "Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design."
- Concatenation aggregation: A way to combine outputs of multiple attention heads by concatenating their vectors (as opposed to summing). "In \Cref{app:head-aggregation}, we reduce summation aggregation to concatenation aggregation."
- Denormalization trick: A technique to remove softmax normalization so that exponentiated scores can be accessed directly. "We can use a simple denormalization trick to obtain obtain instead (see \Cref{app:omitted-proofs}), from which we can compute "
- Denormalized attention head: An attention variant that replaces the row-wise softmax with entry-wise exponentiation, yielding outputs of the form exp(QKT)V. "we call denormalized attention heads, which replaces the row-wise with entry-wise ."
- Direct sum problem: The question of whether solving many instances together can be done asymptotically faster than solving each instance separately. "This is a recurring question in theoretical computer science, typically known as the ``direct sum'' problem, and there are numerous positive and negative examples to this question."
- Embedding dimension: The dimensionality of token representations (hidden vectors) used inside the model. "where is known as the embedding dimension,"
- Extended arithmetic circuit (eAC): An arithmetic circuit augmented with exponential and logarithmic gates to model computations like attention. "Thus, it is natural to also allow exponential gates and logarithmic gates in the arithmetic circuits we consider, and we call them extended arithmetic circuits (eACs)."
- Fine-grained complexity: A framework studying tight running-time lower bounds under plausible conjectures for specific problems. "a popular conjecture in fine-grained complexity that states that there is no algorithm for satisfiability on variables"
- Hardmax attention: An attention mechanism that assigns weight only to keys achieving the maximal score, averaging uniformly over argmax positions. "A hardmax attention head then outputs where is applied row-wise."
- Massively Parallel Computation (MPC) model: A parallel computation model with many machines and communication rounds; used for theoretical comparisons to transformers. "connects the transformer model with the Massively Parallel Computation model by proving an equivalence between transformer and MPC protocol under certain parameters,"
- Matrix multiplication exponent (ω): The infimum exponent such that n×n matrix multiplication can be done in O(nω+o(1)) operations. "where is the matrix multiplication exponent."
- MLP (multi-layer perceptron): A small feed-forward neural network applied row-wise in transformers as the feed-forward blocks. "before finally applying a row-wise multi-layer perceptron (MLP) function."
- Multi-head attention: Running multiple attention heads in parallel within a layer and aggregating their outputs. "It consists of layers of multi-head attention, where each layer runs attention heads in parallel"
- Orthogonal Vectors (OV) Hypothesis: A conjecture that OV cannot be solved in truly subquadratic time, often used to derive conditional lower bounds. "the trivial algorithm of computing matrix products explicitly is optimal for all under a generalization of the Hypothesis"
- Query/Key/Value (Q, K, V) embeddings: The three linear embeddings used in attention to produce queries, keys, and values from inputs. "the attention head consists of query, key and value embedding maps "
- SETH (Strong Exponential Time Hypothesis): The conjecture that SAT cannot be solved in time 21−εn for any ε>0; widely used for conditional lower bounds. "We establish that this is essentially optimal under the Strong Exponential Time Hypothesis (SETH)."
- Softmax: The normalized exponential function producing a probability distribution; applied row-wise in attention. "while applying a softmax operation to the intermediate product of the first two matrices."
- split-VC dimension: A complexity measure used in learning theory and recent transformer expressivity lower bounds. "and split-VC dimension~\cite{kozachinskiy2025strassenattentionsplitvc},"
- Universal approximation theorem: The result that MLPs can approximate any continuous function (on compact sets) given sufficient width. "By the universal approximation theorem which states that any continuous function can be approximated by a MLP (with enough neurons),"
Collections
Sign up for free to add this paper to one or more collections.