Adaptive Length Image Tokenization via Recurrent Allocation
Abstract: Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even LLMs - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.
- Adaptive input representations for neural language modeling. CoRR, abs/1809.10853, 2018. URL http://arxiv.org/abs/1809.10853.
- Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013. doi: 10.1109/TPAMI.2013.50.
- Flexivit: One model for all patch sizes. arXiv preprint arXiv:2212.08013, 2022.
- Token merging: Your ViT but faster. In International Conference on Learning Representations, 2023.
- Matryoshka multimodal models. arXiv preprint arXiv:2405.17430, 2024.
- Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
- François Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/abs/1911.01547.
- Universal transformers. ArXiv, abs/1807.03819, 2018. URL https://api.semanticscholar.org/CorpusID:49667762.
- An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020. URL https://arxiv.org/abs/2010.11929.
- Taming transformers for high-resolution image synthesis, 2020.
- Think before you speak: Training language models with pause tokens, 2024. URL https://arxiv.org/abs/2310.02226.
- Alex Graves. Adaptive computation time for recurrent neural networks. ArXiv, abs/1603.08983, 2016. URL https://api.semanticscholar.org/CorpusID:8224916.
- Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
- Thinking tokens for language modeling, 2024. URL https://arxiv.org/abs/2405.08644.
- Matryoshka query transformer for large vision-language models, 2024.
- Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks, 2023. URL https://arxiv.org/abs/2305.08842.
- Marcus Hutter. The hutter prize. http://prize.hutter1.net, 2006.
- Scalable adaptive computation for iterative generation, 2022.
- Scalable adaptive computation for iterative generation, 2023. URL https://arxiv.org/abs/2212.11972.
- Perceiver IO: A general architecture for structured inputs & outputs. CoRR, abs/2107.14795, 2021a. URL https://arxiv.org/abs/2107.14795.
- Perceiver: General perception with iterative attention, 2021b. URL https://arxiv.org/abs/2103.03206.
- Mixture of nested experts: Adaptive processing of visual tokens, 2024. URL https://arxiv.org/abs/2407.19985.
- Auto-encoding variational bayes, 2022. URL https://arxiv.org/abs/1312.6114.
- Matryoshka representation learning. In Advances in Neural Information Processing Systems, December 2022.
- Universal intelligence: A definition of machine intelligence. CoRR, abs/0712.3329, 2007. URL http://arxiv.org/abs/0712.3329.
- Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Savoias: A diverse, multi-category visual complexity dataset. arXiv preprint arXiv:1810.01771, 2018.
- Jürgen Schmidhuber. Low-complexity art. Leonardo, 30(2):97–103, 1996. doi: 10.2307/1576418.
- Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks, 2021. URL https://arxiv.org/abs/2106.04537.
- Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pp. 2443–2449, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380379. doi: 10.1145/3404835.3463257. URL https://doi.org/10.1145/3404835.3463257.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2022. URL https://arxiv.org/abs/2005.10242.
- Detecting people in artwork with cnns. CoRR, abs/1610.08871, 2016. URL http://arxiv.org/abs/1610.08871.
- Adaptive computation with elastic input sequence, 2023. URL https://arxiv.org/abs/2301.13195.
- Elastictok: Adaptive tokenization for image and video. arXiv preprint, 2024.
- A-ViT: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Vector-quantized image modeling with improved VQGAN. CoRR, abs/2110.04627, 2021. URL https://arxiv.org/abs/2110.04627.
- An image is worth 32 tokens for reconstruction and generation. arxiv: 2406.07550, 2024.
- Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99 URL https://arxiv.org/abs/2406.11837.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.