Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scalable Neural Network Kernels

Published 20 Oct 2023 in cs.LG and cs.AI | (2310.13225v2)

Abstract: We introduce the concept of scalable neural network kernels (SNNKs), the replacements of regular feedforward layers (FFLs), capable of approximating the latter, but with favorable computational properties. SNNKs effectively disentangle the inputs from the parameters of the neural network in the FFL, only to connect them in the final computation via the dot-product kernel. They are also strictly more expressive, as allowing to model complicated relationships beyond the functions of the dot-products of parameter-input vectors. We also introduce the neural network bundling process that applies SNNKs to compactify deep neural network architectures, resulting in additional compression gains. In its extreme version, it leads to the fully bundled network whose optimal parameters can be expressed via explicit formulae for several loss functions (e.g. mean squared error), opening a possibility to bypass backpropagation. As a by-product of our analysis, we introduce the mechanism of the universal random features (or URFs), applied to instantiate several SNNK variants, and interesting on its own in the context of scalable kernel methods. We provide rigorous theoretical analysis of all these concepts as well as an extensive empirical evaluation, ranging from point-wise kernel estimation to Transformers' fine-tuning with novel adapter layers inspired by SNNKs. Our mechanism provides up to 5x reduction in the number of trainable parameters, while maintaining competitive accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. An almost optimal unrestricted fast johnson-lindenstrauss transform. ACM Trans. Algorithms, 9(3):21:1–21:12, 2013. doi: 10.1145/2483699.2483701. URL https://doi.org/10.1145/2483699.2483701.
  2. Scaling learning algorithms towards ai. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston (eds.), Large-scale kernel machines. MIT Press, 2007.
  3. A corrective view of neural networks: Representation, memorization and learning. In Jacob Abernethy and Shivani Agarwal (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp.  848–901. PMLR, 09–12 Jul 2020. URL https://proceedings.mlr.press/v125/bresler20a.html.
  4. Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv preprint arXiv:2205.13535, 2022.
  5. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta (eds.), Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009. URL https://proceedings.neurips.cc/paper_files/paper/2009/file/5751ec3e9a4feab575962e78e006250d-Paper.pdf.
  6. Analysis and extension of arc-cosine kernels for large margin classification. CoRR, abs/1112.3712, 2011. URL http://arxiv.org/abs/1112.3712.
  7. The geometry of random features. In Amos J. Storkey and Fernando Pérez-Cruz (eds.), International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, volume 84 of Proceedings of Machine Learning Research, pp.  1–9. PMLR, 2018. URL http://proceedings.mlr.press/v84/choromanski18a.html.
  8. Hybrid random features, 2022.
  9. The unreasonable effectiveness of structured random orthogonal embeddings. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  219–228, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/bf8229696f7a3bb4700cfddef19fa23f-Abstract.html.
  10. Rethinking attention with performers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.
  11. A sparse johnson–lindenstrauss transform. CoRR, abs/1004.4240, 2010. URL http://arxiv.org/abs/1004.4240.
  12. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 2003. doi: 10.1002/rsa.10073. URL https://doi.org/10.1002/rsa.10073.
  13. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  14. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  15. Linearized two-layers neural networks in high dimension, 2020.
  16. Deep Learning. Adaptive computation and machine learning. MIT Press, 2016. ISBN 978-0-262-03561-3. URL http://www.deeplearningbook.org/.
  17. Random gegenbauer features for scalable kernel methods, 2022.
  18. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=0RDcd5Axok.
  19. Learning with asymmetric kernels: Least squares and feature interpretation. IEEE Trans. Pattern Anal. Mach. Intell., 45(8):10044–10054, 2023. doi: 10.1109/TPAMI.2023.3257351. URL https://doi.org/10.1109/TPAMI.2023.3257351.
  20. Parameter-efficient model adaptation for vision transformers. arXiv preprint arXiv:2203.16329, 2022b.
  21. Orthogonal recurrent neural networks with scaled cayley transform. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  1974–1983. PMLR, 2018. URL http://proceedings.mlr.press/v80/helfrich18a.html.
  22. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  2790–2799. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/houlsby19a.html.
  23. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  24. Neural tangent kernel: Convergence and generalization in neural networks, 2020.
  25. Visual prompt tuning. In European Conference on Computer Vision (ECCV), 2022.
  26. Random feature maps for dot product kernels. In Neil D. Lawrence and Mark Girolami (eds.), Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pp.  583–591, La Palma, Canary Islands, 21–23 Apr 2012. PMLR. URL https://proceedings.mlr.press/v22/kar12.html.
  27. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Annual Meeting of the Association for Computational Linguistics, 2021.
  28. An image is worth 16x16 words: Transformers for image recognition at scale. In Ninth International Conference on Learning Representations. ICLR, 2021.
  29. Kernel methods for learning languages. Theor. Comput. Sci., 405(3):223–236, 2008. doi: 10.1016/j.tcs.2008.06.037. URL https://doi.org/10.1016/j.tcs.2008.06.037.
  30. Cifar-10 (canadian institute for advanced research), 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.
  31. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/, 2010.
  32. Deep learning. Nat., 521(7553):436–444, 2015. doi: 10.1038/nature14539. URL https://doi.org/10.1038/nature14539.
  33. What would elsa do? freezing layers during transformer fine-tuning, 2019.
  34. Chefs’ random tables: Non-trigonometric random features. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/df2d62b96a4003203450cf89cd338bb7-Abstract-Conference.html.
  35. Favor#: Sharp attention kernel approximations via new classes of positive random features. CoRR, abs/2302.00787, 2023. doi: 10.48550/arXiv.2302.00787. URL https://doi.org/10.48550/arXiv.2302.00787.
  36. Decoupled weight decay regularization, 2019.
  37. A kernel-based view of language model fine-tuning. arXiv preprint arXiv:2210.05643, 2022.
  38. Unipelt: A unified framework for parameter-efficient language model tuning, 2022.
  39. Adaptable adapters. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3742–3753, Seattle, United States, July 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.naacl-main.274.
  40. Radford M. Neal. Priors for infinite networks, 1996. URL https://api.semanticscholar.org/CorpusID:118117602.
  41. Learning with non-positive kernels. In Carla E. Brodley (ed.), Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, volume 69 of ACM International Conference Proceeding Series. ACM, 2004. doi: 10.1145/1015330.1015443. URL https://doi.org/10.1145/1015330.1015443.
  42. Adapterhub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020): Systems Demonstrations, pp.  46–54, Online, 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.7.
  43. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis (eds.), Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper_files/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf.
  44. Geometrically coupled monte carlo sampling. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp.  195–205, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/b3e3e393c77e35a4a3f3cbd1e429b5dc-Abstract.html.
  45. Adapterdrop: On the efficiency of adapters in transformers, 2021.
  46. A spectral analysis of dot-product kernels, 2021.
  47. Jürgen Schmidhuber. Deep learning in neural networks: An overview. CoRR, abs/1404.7828, 2014. URL http://arxiv.org/abs/1404.7828.
  48. Learning with kernels : support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, 2002. URL http://www.worldcat.org/oclc/48970254.
  49. Learning potentials of quantum systems using deep neural networks, 2021.
  50. Implicit neural representations with periodic activation functions. In Proc. NeurIPS, 2020.
  51. Improved random features for dot product kernels. arXiv preprint arXiv:2201.08712, 2022a.
  52. Complex-to-real random features for polynomial kernels. arXiv preprint arXiv:2202.02031, 2022b.
  53. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.
  54. Christopher K. I. Williams. Computing with infinite networks. In Proceedings of the 9th International Conference on Neural Information Processing Systems, NIPS’96, pp.  295–301, Cambridge, MA, USA, 1996. MIT Press.
  55. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  56. Towards a unified view on visual parameter-efficient transfer learning. ArXiv, abs/2210.00788, 2022. URL https://api.semanticscholar.org/CorpusID:252683240.
  57. Orthogonal random features. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp.  1975–1983, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/53adaf494dc89ef7196d73636eb2451b-Abstract.html.
  58. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022.
Citations (3)

Summary

  • The paper presents SNNKs as a novel module that replaces traditional FFLs, reducing trainable parameters by up to 5x while maintaining competitive performance.
  • SNNKs leverage Universal Random Features and Fourier transforms to approximate FFL operations, enhancing computational efficiency and lowering storage needs.
  • The approach enables bundling multiple FFLs into a single efficient kernel layer, benefiting resource-constrained applications in NLP and image recognition.

Scalable Neural Network Kernels

Introduction

The paper introduces Scalable Neural Network Kernels (SNNKs) as a novel computational module designed to replace traditional Feedforward Layers (FFLs). This approach allows for efficient computation by disentangling inputs from model parameters and utilizing them in the final dot-product computation through a kernel function. SNNKs offer several benefits, including model compression, computational efficiency, and theoretical insights into neural network architecture. Figure 1

Figure 1: Architecture for (a) SNNK layer.

SNNK Design and Implementation

The SNNK module functions by approximating traditional FFL operations using random feature maps, which linearly transform input data and parameters separately. This transformation results in an overall computation that is less complex than standard FFLs—specifically, if the number of random features mm is much smaller than the input dimension dd, then computational complexity and storage requirements are notably reduced.

These kernels rely on Universal Random Features (URFs) to achieve their approximation of FFLs, an approach that leverages Fourier Transforms of activation functions for constructing mappings Φf\Phi_{f} and Ψf\Psi_{f}. This mechanism provides scalable kernel methods that are independent of specific FFL design, offering adaptability and efficiency across various architectures.

Empirical Evaluation

SNNKs have been empirically assessed in multiple settings, showing effective compression while maintaining competitive accuracy. The mechanism enables up to a 5x reduction in trainable parameters without significant compromise in performance.

Performance on benchmark datasets, such as GLUE for NLP tasks and CIFAR for image recognition, demonstrates that SNNKs can effectively serve as a drop-in replacement for standard FFLs. Notably, SNNK-driven models maintain high accuracy and show improvement in parameter efficiency across different tasks. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Comparison of trainable parameters between various layers/modules and the drop-in replacement NNK layers.

Neural Network Bundling Process

A significant contribution of the paper is the concept of neural network bundling. Bundling is a method by which multiple FFLs are replaced with a single, efficient SNNK implementation, compressing the depth and parameter load of deep networks. This approach can directly benefit both the inference and training phases of neural networks, offering potential theoretical advantages by providing explicit parameter solutions under certain loss functions.

Practical Use Cases

The practical implications of SNNKs include their integration into Transformers for NLP tasks, where models require fine-tuning through efficient adapter layers, notably reducing parameter count. Furthermore, SNNKs' ability to decrease computational requirements makes them suitable for real-world applications constrained by resources, such as in mobile and embedded systems.

Conclusion

SNNKs present a promising direction for scaling efficient neural networks, combining the power of kernel methods with modern neural architecture design. This approach not only seeks computational efficiency but also provides new avenues for model interpretability and theoretical exploration. The realization of SNNKs in numerous applications demonstrates their viability as a replacement for traditional layers, setting a substantial precedent for further research into efficient neural computation.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.