Papers
Topics
Authors
Recent
Search
2000 character limit reached

2-D SSM: A General Spatial Layer for Visual Transformers

Published 11 Jun 2023 in cs.CV and cs.LG | (2306.06635v1)

Abstract: A central objective in computer vision is to design models with appropriate 2-D inductive bias. Desiderata for 2D inductive bias include two-dimensional position awareness, dynamic spatial locality, and translation and permutation invariance. To address these goals, we leverage an expressive variation of the multidimensional State Space Model (SSM). Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme. Empirically, we observe that incorporating our layer at the beginning of each transformer block of Vision Transformers (ViT) significantly enhances performance for multiple ViT backbones and across datasets. The new layer is effective even with a negligible amount of additional parameters and inference time. Ablation studies and visualizations demonstrate that the layer has a strong 2-D inductive bias. For example, vision transformers equipped with our layer exhibit effective performance even without positional encoding

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  2. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. If you use this software, please cite it using these metadata, 58, 2021.
  3. On the expressive power of deep learning: A tensor analysis. In Conference on learning theory, pages 698–728. PMLR, 2016.
  4. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  5. Coatnet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803, 2021.
  6. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
  7. Convit: Improving vision transformers with soft convolutional inductive biases. arXiv preprint arXiv:2103.10697, 2021.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  9. Cswin transformer: A general vision transformer backbone with cross-shaped windows, 2021.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  11. Rikus Eising. Realization and stabilization of 2-d systems. IEEE Transactions on Automatic Control, 23(5):793–799, 1978.
  12. Doubly-indexed dynamical systems: State-space models and structural properties. Mathematical systems theory, 12(1):59–72, 1978.
  13. Simple hardware-efficient long convolutions for sequence modeling. arXiv preprint arXiv:2302.06646, 2023.
  14. Multidimensional linear iterative circuits—general properties. IEEE Transactions on Computers, 100(10):1067–1073, 1972.
  15. It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pages 7616–7633. PMLR, 2022.
  16. Character-level question answering with attention. arXiv preprint arXiv:1604.00727, 2016.
  17. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  18. On the parameterization and initialization of diagonal state space models. arXiv preprint arXiv:2206.11893, 2022.
  19. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in Neural Information Processing Systems, 34, 2021.
  20. Ankit Gupta. Diagonal state spaces are as effective as structured state spaces. arXiv preprint arXiv:2203.14343, 2022.
  21. Simplifying and understanding state space models with diagonal linear rnns. arXiv preprint arXiv:2212.00768, 2022.
  22. Ts Hinamoto. Realizations of a state-space model from two-dimensional input-output map. IEEE Transactions on Circuits and Systems, 27(1):36–44, 1980.
  23. Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8129–8138, 2020.
  24. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
  25. Efficient movie scene detection using state-space transformers. arXiv preprint arXiv:2212.14427, 2022.
  26. Modelling long range dependencies in nd: From task-specific to a general purpose cnn. arXiv preprint arXiv:2301.10540, 2023.
  27. New results in 2-d systems theory, part ii: 2-d state-space models—realization and the notions of controllability, observability, and minimality. Proceedings of the IEEE, 65(6):945–961, 1977.
  28. J Kurek. The general state-space model for a two-dimensional linear digital system. IEEE Transactions on Automatic Control, 30(6):600–602, 1985.
  29. Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492, 2021.
  30. Uniformer: Unifying convolution and self-attention for visual recognition, 2022.
  31. What makes convolutional models great on long sequence modeling? arXiv preprint arXiv:2210.09298, 2022.
  32. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  33. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022.
  34. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  35. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
  36. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
  37. S4nd: Modeling images and videos as multidimensional signals using state spaces. arXiv preprint arXiv:2210.06583, 2022.
  38. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
  39. Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  40. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
  41. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  42. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  43. Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9446–9454, 2018.
  44. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pages 11–30. Springer, 2015.
  45. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  46. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  47. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  48. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.
  49. Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136, 2022.
Citations (11)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 23 likes about this paper.