Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

Published 30 Nov 2020 in cs.LG, cs.NE, and stat.ML | (2012.00152v1)

Abstract: Deep learning's successes are often attributed to its ability to automatically discover new representations of the data, rather than relying on handcrafted features like other learning methods. We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples. The network architecture incorporates knowledge of the target function into the kernel. This improved understanding should lead to better learning algorithms.

Abstract PDF Upgrade to Chat

Citations (70)

View on Semantic Scholar

Summary

The paper establishes that every gradient descent-trained deep network can be approximated as a kernel machine via path kernels.
It demonstrates that path kernels integrate gradient similarities over learning trajectories to capture model behavior.
This insight enhances interpretability and scalability, uniting deep learning with traditional kernel methods.

Deep Networks Are Kernel Machines

Pedro Domingos' paper "Every Model Learned by Gradient Descent Is Approximately a Kernel Machine" provides significant insights into the fundamental nature of deep networks and their training via gradient descent. The paper posits that deep learning models, often seen as fundamentally different from kernel machines, can be reinterpreted through the lens of kernel methods. This reinterpretation has substantial implications for our understanding and practical application of machine learning techniques.

Summary of Core Findings

Domingos demonstrates that models trained by gradient descent can be approximated as kernel machines employing what he terms "path kernels." The path kernel emerges from the integral of the dot product of gradients over the trajectory in the parameter space taken by gradient descent.

Key insights from the paper include:

Equivalence to Kernel Machines: It is mathematically shown that models trained via gradient descent, regardless of their architecture, exhibit behavior equivalent to that of kernel machines using path kernels.
Definition of Path Kernels: Path kernels measure the similarity of model gradients at different data points, effectively capturing how similarly two points influence the learning trajectory.
Interpretability: Deep network weights can be understood as a superposition of training examples in gradient space, enhancing the interpretability of these models.

Implications for Deep Learning and Kernel Methods

For the field of deep learning:

Representation Learning: This finding questions the common assumption that deep learning exclusively relies on automatic feature discovery distinct from traditional, predefined features. Instead, deep learning also relies on gradients of a predefined function, thus resembling kernel methods.
Model Interpretability: By understanding deep networks as kernel machines, we gain straightforward interpretations of network weights, framing them as superpositions of gradients from training data.

For kernel methods:

Path Kernels: These provide a flexible way to encode domain-specific knowledge into the kernel, incorporating architecture-specific insights from deep learning into kernel machines.
Scalability: The method outlined mitigates the scalability constraints traditionally associated with kernel machines, removing the need for computing and storing the Gram matrix, thereby leveraging the efficiencies gained by deep network architectures.

Theoretical and Practical Impacts

The theoretical implications are profound:

Bridging Deep Learning and Kernel Methods: This work forms a conceptual bridge, uniting deep learning's empirical success with the well-articulated mathematical framework of kernel machines.

In practical terms:

Kernel Approximation: Given the equivalence established, one can employ deep learning techniques to approximate and enhance kernel methods.
Hardware Utilization: Specialized hardware that accelerates deep learning can now be effectively used to train kernel machines as well, opening the door for more scalable machine learning solutions.

Future Research Directions

The paper's findings suggest several directions for future research:

Improving Gradient Descent: Viewing gradient descent as a method for learning path kernel machines may lead to novel ways to optimize and improve this fundamental optimization algorithm.
Developing Alternative Algorithms: Exploring methods beyond gradient descent that can form useful superpositions for prediction might yield new machine learning algorithms with improved performance.
Extensions to Other Models: Investigating whether the equivalence extends to nonconvex models and other optimization strategies could further unify diverse machine learning approaches under a shared theoretical framework.

Conclusion

Domingos' exploration reveals that deep networks learned via gradient descent can be fundamentally understood as kernel machines, specifically through path kernels. This realization elevates our comprehension of machine learning models, challenging traditional views on deep learning's unique capabilities. By demonstrating this equivalence, the paper offers a foundation for more interpretable models and scalable algorithms, potentially driving future innovations in both theoretical and practical aspects of machine learning.

Markdown Report Issue