Deformable Audio Transformer for Audio Event Detection
Abstract: Transformers have achieved promising results on a variety of tasks. However, the quadratic complexity in self-attention computation has limited the applications, especially in low-resource settings and mobile or edge devices. Existing works have proposed to exploit hand-crafted attention patterns to reduce computation complexity. However, such hand-crafted patterns are data-agnostic and may not be optimal. Hence, it is likely that relevant keys or values are being reduced, while less important ones are still preserved. Based on this key insight, we propose a novel deformable audio Transformer for audio recognition, named DATAR, where a deformable attention equipping with a pyramid transformer backbone is constructed and learnable. Such an architecture has been proven effective in prediction tasks,~\textit{e.g.}, event classification. Moreover, we identify that the deformable attention map computation may over-simplify the input feature, which can be further enhanced. Hence, we introduce a learnable input adaptor to alleviate this issue, and DATAR achieves state-of-the-art performance.
- Ashish Vaswani et al., “Attention is all you need,” in NeurIPS, 2017, vol. 30.
- Anmol Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020.
- Qiuqiang Kong et al., “Sound event detection of weakly labelled data with cnn-transformer and automatic threshold optimization,” IEEE/ACM TASLP, vol. 28, pp. 2450–2460, 2020.
- Evangelos Kazakos et al., “Slow-fast auditory streams for audio recognition,” in ICASSP. IEEE, 2021.
- Manzil Zaheer et al., “Big bird: Transformers for longer sequences,” in NeurIPS, 2020, vol. 33, pp. 17283–17297.
- Iz Beltagy et al., “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150, 2020.
- Ze Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of ICCV, 2021.
- Wenhai Wang et al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of ICCV, 2021, pp. 568–578.
- Jifeng Dai et al., “Deformable convolutional networks,” in Proceedings of ICCV, 2017, pp. 764–773.
- Xizhou Zhu et al., “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in ICLR, 2021.
- “Deformable video transformer,” in Proceedings of CVPR, 2022, pp. 14053–14062.
- Zhuofan Xia et al., “Vision transformer with deformable attention,” in Proceedings of CVPR, 2022.
- Yue Cao et al., “GCNet: Non-local networks meet squeeze-excitation networks and beyond,” in Proceedings of ICCV Workshop, 2019.
- Daquan Zhou et al., “DeepViT: Towards deeper vision transformer,” arXiv preprint arXiv:2103.11886, 2021.
- “AST: Audio spectrogram transformer,” in Interspeech, 2021.
- Ke Chen et al., “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP. IEEE, 2022, pp. 646–650.
- “Multiscale audio spectrogram transformer for efficient audio classification,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Wentao Zhu, “Efficient multiscale multimodal bottleneck transformer for audio-video classifications,” 2023.
- Wentao Zhu, “Efficient selective audio masked multimodal bottleneck transformer for audio-video classification,” 2023.
- Koichi Miyazaki et al., “Convolution augmented transformer for semi-supervised sound event detection,” in DCASE, 2020.
- Qiuqiang Kong et al., “PANNS: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM TASLP, vol. 28, pp. 2880–2894, 2020.
- “PSLA: Improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE/ACM TASLP, vol. 29, pp. 3292–3306, 2021.
- Oleg Rybakov et al., “Streaming keyword spotting on mobile devices,” in Proc. Interspeech, 2020, pp. 2277–2281.
- Alexey Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
- Chun-Fu Chen et al., “RegionViT: Regional-to-Local Attention for Vision Transformers,” in ICLR, 2022.
- Xiaoyi Dong et al., “CSWin transformer: A general vision transformer backbone with cross-shaped windows,” in CVPR, 2022, pp. 12124–12134.
- Xuran Pan et al., “On the integration of self-attention and convolution,” in CVPR, 2022, pp. 815–825.
- Jianwei Yang et al., “Focal self-attention for local-global interactions in vision transformers,” arXiv preprint arXiv:2107.00641, 2021.
- Pengchuan Zhang et al., “Multi-scale vision longformer: A new vision transformer for high-resolution image encoding,” in ICCV, 2021, pp. 2998–3008.
- Andrew Jaegle et al., “Perceiver: General perception with iterative attention,” in ICML. PMLR, 2021, pp. 4651–4664.
- Song Bai et al., “Visual parser: Representing part-whole hierarchies with transformers,” arXiv preprint arXiv:2107.05790, 2021.
- Yulin Wang et al., “Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition,” NeurIPS, vol. 34, pp. 11960–11973, 2021.
- Xizhou Zhu et al., “Deformable ConvNets v2: More deformable, better results,” in CVPR, 2019, pp. 9308–9316.
- Xiaoyu Yue et al., “Vision transformer with progressive sampling,” in ICCV, 2021, pp. 387–396.
- “Dpt: Deformable patch-based transformer for visual recognition,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2899–2907.
- “End-to-end object detection with transformers,” in European Conference on Computer Vision. Springer, 2020, pp. 213–229.
- “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- “Look, listen and learn,” in Proceedings of ICCV, 2017, pp. 609–617.
- Will Kay et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
- Dima Damen et al., “Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100,” IJCV, 2021.
- “Scaling egocentric vision: The EPIC-KITCHENS dataset,” in ECCV, 2018, pp. 720–736.
- Dima Damen et al., “The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines,” IEEE TPAMI, 2021.
- “VGGSound: A large-scale audio-visual dataset,” in ICASSP. IEEE, 2020.
- “Audiovisual slowfast networks for video recognition,” arXiv preprint arXiv:2001.08740, 2020.
- Arsha Nagrani et al., “Attention bottlenecks for multimodal fusion,” in NeurIPS, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.