TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration
Abstract: Active Visual Exploration (AVE) optimizes the utilization of robotic resources in real-world scenarios by sequentially selecting the most informative observations. However, modern methods require a high computational budget due to processing the same observations multiple times through the autoencoder transformers. As a remedy, we introduce a novel approach to AVE called TOken REcycling (TORE). It divides the encoder into extractor and aggregator components. The extractor processes each observation separately, enabling the reuse of tokens passed to the aggregator. Moreover, to further reduce the computations, we decrease the decoder to only one block. Through extensive experiments, we demonstrate that TORE outperforms state-of-the-art methods while reducing computational overhead by up to 90\%.
- Active vision. International Journal on Computer Vision., 1(4):333–356, 1988.
- Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023.
- BEiT: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
- Longformer: The long-document transformer. arXiv:2004.05150, 2020.
- Survey of decision field theory. Mathematical social sciences, 43(3):345–370, 2002.
- Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. International Conference on Computer Vision (ICCV), 2023.
- End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV), pages 213–229, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- Uni [mask]: Unified inference in sequential decision problems. Advances in neural information processing systems, 35:35365–35378, 2022.
- An empirical study of training self-supervised vision transformers. CVF International Conference on Computer Vision (ICCV), 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Recent advances in convolutional neural networks. Pattern recognition, 77:354–377, 2018.
- Star-transformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 1315–1325. Association for Computational Linguistics, 2019.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Learning to look around: Intelligently exploring unseen environments for unknown tasks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 1238–1247, 2018a.
- Learning to look around: Intelligently exploring unseen environments for unknown tasks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1238–1247, 2018b.
- Simglim: Simplifying glimpse based active visual reconstruction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 269–278, 2023.
- Inaction: Interpretable action decision making for autonomous driving. In European Conference on Computer Vision, pages 370–387. Springer, 2022.
- What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems 30, NIPS, pages 5574–5584, 2017.
- Learning multiple layers of features from tiny images. 2009.
- Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
- Weighing counts: Sequential crowd counting by reinforcement learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 164–181. Springer, 2020.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Counting crowd by weighing counts: A sequential decision-making perspective. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Recurrent models of visual attention. Advances in neural information processing systems, 27, 2014.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Active visual exploration based on attention-map entropy. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI, pages 1303–1311, 2023.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Misconv: Convolutional neural networks for missing data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2060–2069, 2022.
- Multitask mixture of sequential experts for user activity streams. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3083–3091, 2020.
- Sidekick policy learning for active visual exploration. In Proceedings of the European conference on computer vision (ECCV), pages 413–430, 2018.
- Consistency driven sequential transformers attention model for partially observable scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2518–2527, 2022.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems 34 (NeurIPS), pages 13937–13949, 2021.
- Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
- Where to look next: Unsupervised active visual exploration on 360 {{\{{\\\backslash\deg}}\}} input. arXiv preprint arXiv:1909.10304, 2019.
- Attend and segment: Attention guided active semantic segmentation. In European Conference in Computer Vision (ECCV), pages 305–321, 2020.
- Glimpse-attend-and-explore: Self-attention for active visual exploration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
- Waldboost-learning for time constrained sequential detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), pages 150–156. IEEE, 2005.
- Tracking as online decision-making: Learning a policy from streaming videos with reinforcement learning. In Proceedings of the IEEE international conference on computer vision, pages 322–331, 2017.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML), pages 10347–10357, 2021.
- Deit iii: Revenge of the vit. In European Conference on Computer Vision, pages 516–533, 2022.
- Learning to track: Online multi-object tracking by decision making. In Proceedings of the IEEE international conference on computer vision, pages 4705–4713, 2015.
- Recognizing scene viewpoint using panoramic place representation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2695–2702, 2012.
- Representation matters: Offline pretraining for sequential decision making. In International Conference on Machine Learning, pages 11784–11794. PMLR, 2021.
- A-ViT: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10809–10818, 2022a.
- A-vit: Adaptive tokens for efficient vision transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pages 10799–10808, 2022b.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
- On compressing deep models by low rank and sparse decomposition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 67–76. IEEE Computer Society, 2017.
- Patch-level representation learning for self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8354–8363, 2022.
- Multi-shot pedestrian re-identification via sequential decision making. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6781–6789, 2018.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890, 2021.
- ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022.
- Vision-dialog navigation by exploring cross-modal memory. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10730–10739, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.