Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding

Published 1 Apr 2024 in cs.CV and cs.MM | (2404.01174v2)

Abstract: Temporal video grounding (TVG) is a critical task in video content understanding, requiring precise alignment between video content and natural language instructions. Despite significant advancements, existing methods face challenges in managing confidence bias towards salient objects and capturing long-term dependencies in video sequences. To address these issues, we introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding. Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages in handling different aspects of the task. Specifically, we use SNNs to develop a spiking saliency detector that generates the proposal set. The detector emits spike signals when the input signal exceeds a predefined threshold, resulting in a dynamic and binary saliency proposal set. To enhance the model's capability to retain and infer contextual information, we introduce relevant slots which learnable tensors that encode prior knowledge. These slots work with the contextual moment reasoner to maintain a balance between preserving contextual information and exploring semantic relevance dynamically. The SSMs facilitate selective information propagation, addressing the challenge of long-term dependency in video content. By combining SNNs for proposal generation and SSMs for effective contextual reasoning, SpikeMba addresses confidence bias and long-term dependencies, thereby significantly enhancing fine-grained multimodal relationship capture. Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods across mainstream benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021.
  2. Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23023–23033, 2023.
  3. Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13846–13856, 2023.
  4. Llavilo: Boosting video moment retrieval via adapter-based multimodal modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2798–2803, 2023.
  5. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023.
  6. Less is more: Learning highlight detection from video duration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1258–1267, 2019.
  7. Joint visual and audio learning for video highlight detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8127–8137, 2021.
  8. Text-visual prompting for efficient 2d temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14794–14804, 2023.
  9. Scanning only once: An end-to-end framework for fast temporal grounding in long videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13767–13777, 2023.
  10. Vlg-net: Video-language graph matching network for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3224–3234, 2021.
  11. Memory-efficient temporal moment localization in long videos. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1909–1924, 2023.
  12. Can shuffling video benefit temporal bias problem: A novel training framework for temporal grounding. In European Conference on Computer Vision, pages 130–147. Springer, 2022.
  13. Language-free training for zero-shot video grounding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2539–2548, 2023.
  14. Deep residual learning in spiking neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  15. State transition of dendritic spines improves learning of sparse spiking neural networks. In International Conference on Machine Learning (ICML), 2022.
  16. Pruning of deep spiking neural networks through gradient rewiring. International Joint Conferences on Artificial Intelligence (IJCAI), 2021.
  17. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  18. Neuron-based spiking transmission and reasoning network for robust image-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022.
  19. Seenn: Towards temporal spiking early exit neural networks. Advances in Neural Information Processing Systems, 36, 2024.
  20. Spatio-temporal approximation: Atraining-free snn conversion for transformers. ICLR, 2024.
  21. Spikepoint: An efficient point-based spiking neural network for event cameras action recognition. ICLR, 2024.
  22. Motion-decoupled spiking transformer for audio-visual zero-shot learning. In Proceedings of the 31st ACM International Conference on Multimedia, page 3994–4002, 2023.
  23. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  24. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  25. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
  26. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  27. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560, 2024.
  28. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491, 2024.
  29. Vivim: a video vision mamba for medical video object segmentation. arXiv preprint arXiv:2401.14168, 2024.
  30. Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024.
  31. Pan-mamba: Effective pan-sharpening with state space model. arXiv preprint arXiv:2402.12192, 2024.
  32. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.
  33. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 447–463. Springer, 2020.
  34. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3042–3051, 2022.
  35. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12870–12877, 2020.
  36. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10287–10296, 2020.
  37. Cross-category video highlight detection via set-based learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7970–7979, 2021.
  38. Mini-net: Multiple instance ranking network for video highlight detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 345–360. Springer, 2020.
  39. Temporal cue guided video highlight detection with low-rank audio-visual fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7950–7959, 2021.
Citations (11)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.