Computational Limits of Deep Learning

Updated 17 February 2026

Computational limits of deep learning are defined by scaling laws linking data, model size, and compute, imposing hard constraints on performance improvements.
Optimization challenges and finite-precision errors restrict network expressivity and reproducibility, despite the theoretical potential of deep architectures.
Complexity and computability analyses reveal that algorithmic, numerical, and dynamical barriers fundamentally cap the achievable accuracy and efficiency of deep learning models.

The computational limits of deep learning are determined by the interplay between function class complexity, algorithmic scaling laws, optimization dynamics, architectural and hardware bottlenecks, finite-precision computation, and fundamental computability barriers. Recent research provides a rigorous quantitative and qualitative framework for understanding these limits, drawing on approximation theory, information theory, statistical mechanics, circuit complexity, ergodic theory, and computability theory.

1. Scaling Laws: Data, Model Size, and Computational Cost

Empirical and theoretical results establish that generalization error in deep learning exhibits predictable scaling with respect to training data size and model size. For state-of-the-art models and tasks, the generalization error $\epsilon(m)$ follows a power-law in training set size $m$ :

$\epsilon(m) \simeq \alpha m^{\beta_g}, \quad \beta_g \in [-0.5, 0]$

with $\alpha, \beta_g$ task-dependent constants. Model size (parameter count $p$ ) required to fit $m$ samples likewise grows sublinearly:

$p(m) \simeq \sigma m^{\beta_p}, \quad \beta_p \in [0.5, 1)$

Achieving new “frontier” targets (e.g., surpassing human-level error) necessitates immense resource increases: datasets must expand by $33\times$ to $971\times$ , and models by $6.6\times$ to $456\times$ depending on the domain (Hestness et al., 2019). Training FLOP cost to process samples at this scale follows:

$C_{\text{total}} \simeq m_{\text{target}} \cdot \gamma p_{\text{target}}$

where $\gamma$ is the architecture-dependent per-parameter FLOP constant. For recurrent architectures, memory footprint per step grows linearly with $p$ , resulting in multi-terabyte RAM requirements far exceeding today’s hardware capabilities. Thus, beyond accuracy plateaus, compute and memory scaling laws form a hard technical ceiling for current deep learning (Thompson et al., 2020, Hestness et al., 2019, Rosenfeld, 2021).

2. Expressivity Versus Circuit and Algorithmic Complexity

Deep neural networks of sufficient depth and width are Kolmogorov-optimal approximators for broad function classes; the best $M$ -weight error over a class $\mathcal{C}$ scales as $O(M^{-\gamma^*(\mathcal{C})})$ where $\gamma^*(\mathcal{C})$ is dictated by the metric entropy of $\mathcal{C}$ (Elbrächter et al., 2019). For many smooth or structured function spaces (e.g., balls in Besov or modulation spaces), deep ReLU networks achieve the minimax-optimal rate. Moreover, for analytic operations—including multiplication, polynomials, sinusoids, oscillatory textures, and certain fractals—construction exists for networks with exponential-in- $M$ (“spectral”) approximation rates.

Shallow (finite-depth) networks, regardless of width, cannot attain exponential rates for high-frequency or highly oscillatory functions; thus depth is strictly necessary for Kolmogorov-optimality. However, the realizability of these formal rates in practical training is strongly constrained by computational and numerical factors (see below) (Elbrächter et al., 2019).

3. Optimization and Trainability Barriers

In practice, learning error is composed of three dominant sources:

$\epsilon(m,n) = \epsilon_{\min} + U(n) + L(m,n)$

where $\epsilon_{\min}$ is the theoretical lower bound (Bayes/realizability), $U(n)$ is the uncertainty error due to finite sample size (power-law decay, $U(n) \sim n^{-\alpha}, \alpha \approx 0.5-1.1$ ), and $L(m,n)$ is a learning deficiency plateau due to optimization failures or inductive bias misalignments. Empirically, $L(m,n) > 0$ persists even for overparametrized models and large $n$ . Achieving small $\Delta\epsilon$ requires $n = \Omega(\Delta\epsilon^{-1/\alpha})$ and $m = \Omega(\Delta\epsilon^{-1/\beta})$ with exponents near unity, which implies rapidly exploding compute, memory, and data requirements (Rosenfeld, 2021).

Beyond data and model scaling, a fundamental training-time lower bound arises from stochastic thermodynamic principles: under gradient flow or Langevin dynamics, the minimal time $T$ to morph the weight distribution from random initialization $P_0$ to $P_T$ is

$T \geq \frac{W_2^2(P_0, P_T)}{R}$

where $W_2$ is Wasserstein-2 distance between weight distributions and $R$ is the entropy production (irreversibility) accumulated during training. For linearized models (NTK regime) under plausible label and spectrum statistics, large-scale training achieves optimal scaling, with $T / T_{\text{SL}} \to O(1)$ (Seroussi et al., 2023).

4. Algorithmic and Complexity-Theoretic Barriers

Information-theoretic and computational-complexity results show that there are concept classes for which no descent-based, polynomial-time deep learning algorithm can succeed—even when the network class is fully expressive. The cross-predictability measure

$\operatorname{CP}(P_X, P_F) = \mathbb{E}_{f,f'} \left[ \left( \mathbb{E}_x f(x)f'(x) \right)^2 \right]$

captures this obstruction: for tasks where CP decays faster than $1/\mathrm{poly}(n)$ , no memory-limited or randomly-initialized gradient-based algorithm achieves nontrivial accuracy. Examples include learning parities, random $k$ -monomials with $k = \omega(1)$ , certain global graph properties, and some arithmetic tasks. These negative results do not contradict representational completeness—they arise from computational limits of descent-based optimization (Abbe et al., 2018).

In deep attention networks and SSM models, sharp sample-complexity phase transitions occur: below a certain threshold $\alpha_c = N/D$ (for $N$ samples, $D$ dimension) weak recovery is information-theoretically impossible; at higher $\alpha$ recovery becomes achievable by Bayesian-optimal inference, and possibly approximate message passing (AMP). A gap can exist between information-theoretic and algorithmic thresholds, i.e., phases where learning is statistically possible but computationally intractable for polynomial-time methods. Further, sequential layer-wise recovery can enforce multi-step learning timescales, evidencing a "grand staircase" of phase transitions (Troiani et al., 2 Feb 2025).

For sequence models, SSMs and Transformers with finite precision can only recognize regular languages. Function composition tasks require state sizes, step counts, or chain-of-thought passes that scale at least as $\Omega(\sqrt{n \log n})$ , where $n$ denotes the problem domain size. This limitation is reflected in catastrophic performance breakdowns on arithmetic and multi-step reasoning tasks in empirical studies (Zubić et al., 2024).

Recent circuit-complexity results show that FlowAR, Transformer, and softmax-attention architectures are simulable in uniform $\mathsf{TC}^0$ —constant-depth, polynomial-size threshold circuits. Thus, their expressivity is bounded to functions in $\mathsf{TC}^0$ , and cannot realize higher-complexity classes such as parity or NL-complete problems unless architectural modifications introduce superconstant depth or explicit algorithmic modules (Gong et al., 23 Feb 2025).

5. Numerical and Precision-Induced Barriers

The theoretical expressivity of deep ReLU networks depends on exponential growth of affine (linear) regions with depth. Under floating-point arithmetic, with realistic rounding errors and roundoff accumulation, the number of distinguishable affine pieces produced by gradient descent is polynomial in network size and linear in depth—even after many training steps. Thus, the practical complexity realized by real-world training is orders of magnitude lower than the exponential benchmarks assumed by approximation theory. Achieving exponential region multiplicity would require either exponentially many descent steps or unphysical precision levels. This imposes a polynomial ceiling on network complexity in realistic SGD-based training (Karner et al., 2022).

Empirically, activation region counts, measured under both practical and idealized scenarios, rapidly collapse during training if precision is not extremely high, and under moderate or high noise the complexity cannot be sustained. As a consequence, finite-precision phenomena are an inescapable computational limit for ReLU architectures.

6. Computability and Undecidability Constraints

Functional and computational analysis on digital hardware (Turing machines) establishes that certain problems encountered in deep learning are incomputable in a strong sense. For inverse problems such as basis pursuit and Lasso in finite dimensions, Banach–Mazur computability fails when regularization parameters are small: it is impossible to uniformly approximate any single-valued solution map below a fixed “breakdown- $\epsilon$ ” threshold ($1/4$ for basis pursuit, $1/8$ for Lasso). Therefore, no digital computation—gradient-based or otherwise—can guarantee arbitrary-precision reconstruction. This fundamental accuracy bound holds even for well-posed, well-conditioned problem instances (Boche et al., 2022).

For continuous-data classification and network training, similar computability barriers appear. Most nontrivial continuous classification problems lack computable solution functions; even if some computable approximation is constructed, it is algorithmically undecidable to know whether a given prediction is within a specified error margin of the true function—thus, a general-purpose “exit flag” (warning on prediction unreliability) is itself uncomputable (Boche et al., 2024). Furthermore, even in well-posed network training scenarios, it is impossible to provide a universal computable training algorithm that, from finite sample data, recover the exact network realization or certifies exact fitting.

Relaxation to approximate solutions (e.g., $\epsilon$ -close interpolation on finite samples) restores computability in principle, and explicit quantization (fixing domain and parameter grids) recovers decidability and algorithmic solvability at the cost of approximation error and practical infeasibility (enumeration-based methods). Still, inherent computability failures on real-valued domains place a non-negotiable ceiling on the trustworthiness of DL systems for safety- or correctness-critical applications (Boche et al., 2024, Boche et al., 2022).

7. Dynamical, Ergodic, and Predictability Limits

The training dynamics of deep networks, understood as high-dimensional discrete-time dynamical systems, exhibit additional sources of unpredictability. Riddled basins of attraction emerge due to a combination of chaotic learning dynamics and invariance under symmetry (parameter permutation or sign-flip symmetries). These riddled basins have fractal boundaries of co-dimension nearly zero, as quantified by an uncertainty exponent $\varphi \approx 0$ . In this regime, small perturbations of initial parameters—down to near machine precision—do not suffice to reliably predict the outcome; even huge increases in initialization accuracy yield at best marginal gains in reproducibility. This property is robust to changes in optimizer, batch size, or hardware, and is intrinsically linked to maximizing performance—parameter regimes yielding the highest average accuracy also exhibit severe riddling (Ly et al., 7 Oct 2025).

As a result, outcome-level unpredictability, irreproducibility, and the impossibility of ex-ante certification of learned model behavior are unavoidable features of modern deep-learning optimization, even in fully deterministic, idealized settings.

Collectively, these limits demarcate the boundaries of deep learning’s computational reach. They arise from multiple sources: scaling laws that make brute-force improvement increasingly unsustainable; complexity-theoretic and information-theoretic learning barriers that block polynomial-time algorithms; numerical and finite-precision effects that cap practical expressivity; computability and undecidability phenomena that thwart perfect reliability; and dynamical effects that render outcomes fundamentally irreproducible. Addressing or circumventing these would require paradigm shifts in algorithm design, architecture, hardware, and theoretical understanding.