Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep Learning is Not So Mysterious or Different

Published 3 Mar 2025 in cs.LG and stat.ML | (2503.02113v2)

Abstract: Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. We argue that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized, using long-standing generalization frameworks such as PAC-Bayes and countable hypothesis bounds. We present soft inductive biases as a key unifying principle in explaining these phenomena: rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem. However, we also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning, phenomena such as mode connectivity, and its relative universality.

Summary

  • The paper shows that generalization phenomena like benign overfitting and double descent occur in both deep networks and simpler linear models.
  • It employs soft inductive biases and order-dependent regularization to explain robust generalization without relying on model-specific architecture constraints.
  • The study distinguishes deep networks through unique aspects such as representation learning, mode connectivity, and in-context learning.

Authoritative Essay on "Deep Learning is Not So Mysterious or Different" (2503.02113)

Introduction

"Deep Learning is Not So Mysterious or Different" systematically deconstructs the prevailing narrative that deep neural networks exhibit unique, enigmatic generalization behavior unlike other model classes. The paper rigorously examines phenomena such as benign overfitting, double descent, and the effectiveness of overparametrization, positing that these attributes are not exclusive to deep learning and are readily reproduced and understood via simple linear models and established theoretical frameworks. The central thesis is that soft inductive biases, rather than hard architectural constraints, provide a coherent blueprint for achieving robust generalization across all model classes. The discourse is further expanded to delineate genuinely distinctive characteristics of deep networks, such as representation learning, mode connectivity, and universality in in-context learning.

Generalization Phenomena: Bridging Deep Learning and Simple Models

The paper presents compelling evidence that key generalization phenomena attributed to neural networks are neither exclusive nor mysterious. Benign overfitting and double descent are illustrated with high-order polynomials and linear models, emphasizing that the phenomena can be replicated outside deep learning. A $150$th order polynomial with order-dependent regularization demonstrates benign overfitting, fitting both structured and unstructured data while preserving generalization; Gaussian processes are shown to mimic CNN behavior on CIFAR-10, including the capacity to fit noisy labels with maintained performance. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Generalization phenomena associated with deep learning can be reproduced with simple linear models and understood.

Additionally, double descent is observed in both ResNet architectures and linear random feature models—highlighting the phenomenon's independence from neural network-specific mechanisms.

Formal Generalization Frameworks: PAC-Bayes and Compressibility

The paper asserts that classical frameworks such as VC dimension and Rademacher complexity are insufficient to explain these phenomena due to their reliance on hypothesis space cardinality. By contrast, PAC-Bayes and countable-hypothesis bounds encapsulate the effect of compressibility and simplicity biases, offering non-vacuous generalization guarantees for models with millions or billions of parameters. The empirical risk and compressibility—quantified via Kolmogorov complexity—upper bound generalization, providing a procedural recipe for learning: maximize hypothesis flexibility while biasing towards compressible solutions. Figure 2

Figure 2: Generalization phenomena can be formally characterized by generalization bounds anchored in empirical risk and compressibility.

This approach aligns with Solomonoff induction, which assigns exponential preference to simpler programs, ensuring that, even in maximally flexible hypothesis spaces, compressible (hence simple) solutions generalize.

Soft Inductive Biases: Beyond Restriction

A unifying theme is the concept of soft inductive biases, which prefer rather than restrict certain solutions, enabling flexible hypothesis spaces without deleterious overfitting. Multiple mechanisms implement soft biases: order-dependent regularization, architectural design (e.g., residual pathway priors), and implicit Bayesian priors. The paper presents order-dependent regularization in polynomials as a canonical example, showing that the model adaptively selects simpler explanations in low-data regimes and flexibly fits complex data as needed; regularized high-order polynomials perform as well or better than constrained models across varying data complexities and sizes. Figure 3

Figure 3: Soft inductive biases enlarge the hypothesis space with preferences for particular solutions over others, enabling flexible learning without overfitting.

Figure 4

Figure 4: Achieving good generalization with soft inductive biases involves steering optimization toward preferred solutions within a rich hypothesis space.

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: Flexibility combined with simplicity bias enables appropriate model adaptation across diverse problem complexities and dataset sizes.

Overparametrization and Double Descent: Flexibility Amplifies Compression Bias

The paper provides a nuanced perspective on overparametrization, repudiating parameter counting as a valid complexity metric. Larger models not only increase flexibility but also enhance compressibility: flat minima occupy exponentially greater volumes in the parameter space, and are more discoverable via optimization, fostering a simplicity bias. Empirical studies highlighted show that wide models manifest lower effective dimensionality and stronger compression (Kolmogorov complexity), ultimately improving generalization, even when training achieves zero loss. Figure 6

Figure 6: Increasing parameters promotes generalization by increasing the volume of flat, compressible solutions.

Double descent is contextualized as a transition from classical bias-variance regimes to interpolation regimes, wherein further parameter increases favor simpler (low-norm) solutions among interpolants, improving generalization. The phenomenon is not unique to neural networks and has historical precedents in statistical models and random matrix theory.

Genuine Distinctions of Deep Learning: Representation, Universality, and Mode Connectivity

Although generalization phenomena are not mysterious, deep networks remain distinctive in several respects:

  • Representation Learning: Neural networks adaptively learn basis functions, dynamically capturing relevant features for high-dimensional signals, surpassing fixed-kernel or basis models. This ability underpins successful extrapolation and nuanced similarity metrics tailored to complex tasks.
  • Universality and In-Context Learning: Deep networks, especially transformers, exhibit a compression bias closely matching the real-world data distributions, resulting in universal applicability across modalities and tasks. In-context learning is robust, with pre-trained models achieving competitive zero-shot performance on diverse domains, explained via flexible mixture-of-experts analogies.
  • Mode Connectivity: Loss landscapes in deep networks exhibit connected modes, allowing traversal between independently trained solutions along simple curves with negligible loss increase. This property contradicts a conventional view of isolated local minima, facilitating model merging (e.g., SWA) and providing new avenues for understanding generalization. Figure 7

Figure 7

Figure 7

Figure 7: Modes in deep neural network loss landscapes are connected along curves, not isolated, a property relatively distinct to deep networks.

Implications and Speculative Outlook

The findings challenge foundational assumptions in machine learning theory, advocating a move away from rigid restriction biases and parameter counting. Theoretical frameworks embracing compressibility and soft inductive biases offer scalable, robust guarantees for both neural and non-neural models. Practically, this supports development of model-agnostic learners and motivates research in adaptive priors and compressibility-aware architectures. Future directions include: deeper empirical evaluation of generalization bounds, exploring measures beyond Kolmogorov complexity, and optimizing soft bias implementations for computational efficiency.

Furthermore, distinctive properties such as representation learning and mode connectivity suggest continued progress toward universality and adaptive intelligence in AI, modulated by compressibility rather than explicit architecture constraints.

Conclusion

The paper "Deep Learning is Not So Mysterious or Different" articulates an integrative framework that demystifies generalization behavior observed in deep learning, grounding it in general principles applicable to a broad spectrum of models. Overparametrization, benign overfitting, and double descent are traced to soft inductive biases and compressibility, rather than inherent neural network properties. The theoretical and empirical analyses reinforce the universality of these concepts, while distinguishing neural networks in representation learning, mode connectivity, and in-context universality. This synthesis reframes deep learning as remarkable not for its mysterious generalization, but for its practical expressivity and adaptability, signaling a paradigm shift in both theoretical understanding and model design.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Deep Learning Is Not So Mysterious or Different — A Simple Explanation

Overview: What is this paper about?

This paper argues that some famous “weird” behaviors people see in deep learning—like huge models working well, models that fit random noise yet still do fine on real data, and test error going down-then-up-then-down again as models get bigger—are not actually mysterious or unique to deep neural networks. The author shows that:

  • These behaviors also show up in very simple models.
  • Longstanding theories can explain them clearly.
  • A key idea called soft inductive bias ties everything together.

The main questions the paper asks

  • Why do very large models (sometimes even larger than the dataset) still generalize well to new data?
  • How can a model perfectly fit random noise but still perform well on real, structured data (benign overfitting)?
  • Why does test error sometimes go down, then up, then down again as we increase model size (double descent)?
  • Do we need brand-new theory to explain deep learning, or do existing tools already work?

How the authors approach these questions

The paper uses both simple examples and established theory to make its case:

  • Simple models: The author shows that even a basic model like a high-degree polynomial (a curve with many wiggles) can behave like a deep net if you set it up with the right “preference” for simple solutions. Another example uses linear models with random features. These reproduce the same “mysterious” behaviors seen in deep learning.
  • Soft inductive bias: Instead of banning complicated solutions, you allow all possible solutions but gently prefer simpler ones. Think of it like a dress code that says “prefer plain T‑shirts,” not “T‑shirts only.” This “soft bias” can come from:
    • Regularization (penalties that make messy solutions less attractive),
    • Bayesian priors (preferences built into the model before seeing data),
    • Architecture choices that encourage simple patterns (like symmetry or compression).
  • Generalization theory that already exists:
    • PAC-Bayes and countable-hypothesis bounds: These are math tools that say your future test error is small if two things are true: your training error is small and your final model can be described simply (it’s compressible—like a small zip file).
    • Kolmogorov complexity: A fancy way to measure “how short is the description of your solution?” Shorter is simpler.
    • Marginal likelihood: A Bayesian score that lines up with these bounds and rewards solutions that explain the data well without being overly complicated.
  • A helpful intuition: effective dimensionality. Imagine your model has lots of knobs (parameters), but only some really matter for the final result. Effective dimensionality measures roughly how many knobs are “sharp” and need to be tuned. Flatter solutions (fewer sharp knobs) are simpler, compress better, and usually generalize better.

What the paper finds and why it matters

1) Benign overfitting is not unique to deep nets

  • What people saw: Big neural networks can perfectly fit random labels but still do well on real tasks.
  • What the paper shows: Simple models can do this too if they have soft biases toward simple explanations. For real, structured data, the model naturally picks a simple pattern; for pure noise, it will still fit if forced, but that doesn’t mean it will overfit real structure.
  • Why this matters: You don’t need brand-new theory to explain this—PAC-Bayes and related ideas already do.

2) Overparameterization (having more parameters than data) can help

  • What people assume: More parameters means more overfitting.
  • What actually happens: Bigger models can be both more flexible and more biased toward simple, “flat” solutions. Bigger models often compress better after training and end up with fewer “effective” parameters. They contain many ways to fit the data well, and many of those ways are simple and generalize.
  • Why this matters: Counting parameters isn’t a good measure of how complex a model really is. Simplicity and compressibility matter more.

3) Double descent has a clear, non-mysterious explanation

  • The pattern: As model size grows, test error goes down (good), then up (overfitting), then down again (surprisingly good).
  • Intuition:
    • Small to medium size: The model learns useful structure (test error drops).
    • Mid-size “bump”: The model starts to overfit (test error rises).
    • Very large size: There are so many good, flat (simple) solutions that training naturally finds these compressible ones, and test error drops again.
  • Why this matters: This effect is not special to deep nets; simple linear models can show it too. The second descent is about finding simple, flat solutions in a vast space, not about “more flexibility = more overfitting.”

4) The right theory matches what we see in practice

  • PAC-Bayes and countable-hypothesis bounds say: expected test error ≤ training error + a simplicity/compression term.
  • This matches observations:
    • Large models that compress well (small file size, flat minima) can generalize.
    • These bounds can be made non-vacuous (actually meaningful) even for very large models, including LLMs.

5) Deep learning is still special in other ways

  • While generalization isn’t so mysterious, deep nets do stand out for:
    • Representation learning (discovering useful features),
    • Mode connectivity (different good solutions are connected in parameter space),
    • Broad, often universal usefulness,
    • Strong in-context learning in LLMs.

Why this is important

This paper tells us we don’t need to throw out decades of learning theory to understand why deep learning works. If we focus on:

  • letting models be flexible,
  • nudging them toward simpler explanations,
  • and measuring how compressible and flat their solutions are,

then the “mysteries” like benign overfitting, overparameterization, and double descent make sense. This helps guide how we design and train future models: build flexible systems with strong, soft preferences for simple, compressible solutions.

Key terms in plain language

  • Inductive bias: The model’s built-in preferences for certain types of solutions. Hard bias: strict rules (“must”). Soft bias: gentle preferences (“prefer”).
  • PAC-Bayes: A generalization framework that says you’ll do well on new data if you both fit the training data and keep the final solution simple/compressible.
  • Kolmogorov complexity: How short a description (like the shortest computer program) can produce your solution. Shorter = simpler.
  • Compressibility: If a trained model can be saved as a small file, it’s simpler and more likely to generalize.
  • Effective dimensionality: How many “truly important knobs” the solution is using. Fewer important knobs means flatter, simpler, more robust solutions.
  • Marginal likelihood: A Bayesian score that prefers models that can explain the data well without needing overly complicated settings.

Bottom line

Deep learning’s surprising behaviors aren’t so surprising after all. They appear in simple models too and can be explained by well-known ideas. The winning recipe is: allow lots of possibilities, but softly prefer the simple ones that fit the data—then good generalization follows.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 47 tweets with 2317 likes about this paper.

HackerNews