Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

Published 4 Aug 2024 in cs.CV | (2408.02034v3)

Abstract: Recently, scaling images to high resolution has received much attention in multimodal LLMs (MLLMs). Most existing practices adopt a sliding-window-style cropping strategy to adapt to resolution increase. Such a cropping strategy, however, can easily cut off objects and connected regions, which introduces semantic discontinuity and therefore impedes MLLMs from recognizing small or irregularly shaped objects or text, leading to a phenomenon we call the semantic sawtooth effect. This effect is particularly evident in lightweight MLLMs. To address this issue, we introduce a Complementary Image Pyramid (CIP), a simple, effective, and plug-and-play solution designed to mitigate semantic discontinuity during high-resolution image processing. In particular, CIP dynamically constructs an image pyramid to provide complementary semantic information for the cropping-based MLLMs, enabling them to richly acquire semantics at all levels. Furthermore, we introduce a Scale Compression Mechanism (SCM) to reduce the additional computational overhead by compressing the redundant visual tokens. Our experiments demonstrate that CIP can consistently enhance the performance across diverse architectures (e.g., MiniCPM-V-2, InternVL2, and LLaVA-OneVision), various model capacity (1B$\rightarrow$8B), and different usage configurations (training-free and fine-tuning). Leveraging the proposed CIP and SCM, we introduce a lightweight MLLM, Mini-Monkey, which achieves remarkable performance in both general multimodal understanding and document understanding. On the OCRBench, the 2B-version Mini-Monkey even surpasses the 8B model InternVL2-8B by 12 score. Additionally, training Mini-Monkey is cheap, requiring only eight RTX 3090 GPUs. The code is available at https://github.com/Yuliang-Liu/Monkey.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that multi-scale adaptive cropping (MSAC) significantly reduces the semantic sawtooth effect by preserving object integrity in high-resolution images.
The paper introduces a scale compression mechanism (SCM) that efficiently drops less significant tokens, lowering computational costs without sacrificing performance.
The paper reports state-of-the-art results on OCRBench and multiple multimodal and document understanding benchmarks, underscoring its practical impact.

Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping

The paper "Mini-Monkey: Alleviate the Sawtooth Effect by Multi-Scale Adaptive Cropping" explores the enhancement of lightweight multimodal LLMs (MLLMs) for high-resolution image processing through an innovative multi-scale adaptive cropping strategy (MSAC). The authors—Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, and Xiang Bai—propose Mini-Monkey, a lightweight MLLM that effectively addresses the challenges posed by the traditional cropping strategies used in MLLMs.

Introduction

The primary motivation behind this work stems from the deficiencies in the current MLLMs' ability to handle high-resolution images. Traditional cropping methods often lead to the segmentation of small or irregularly shaped objects, impairing the model’s ability to recognize detailed scene elements. This issue, referred to as the "sawtooth effect," is particularly detrimental to lightweight MLLMs. To overcome this, the authors introduce Mini-Monkey, which integrates the MSAC as a plug-and-play method and incorporates a Scale Compression Mechanism (SCM) to mitigate the computational overhead introduced by MSAC.

Methodology

Multi-Scale Adaptive Cropping Strategy (MSAC)

MSAC enhances the image preprocessing pipeline by generating multi-scale representations, allowing the model to select non-segmented objects from various scales without significantly increasing computational costs. Unlike existing cropping strategies, MSAC employs a stratified approach that segments images according to different aspect ratios and resolutions while avoiding the segmentation of connected objects. This method ensures semantic consistency and is particularly effective in detailed scene understanding.

Scale Compression Mechanism (SCM)

To address the computational overhead introduced by the multi-scale cropping, SCM is introduced as a training-free and parameter-free compression module. SCM leverages well-trained attention layers within the MLLM to generate attention weights and selectively drop less significant tokens based on these weights. This allows for efficient token compression, maintaining high performance without substantially increasing the computational requirement.

Experimental Results

The empirical evaluation demonstrates that Mini-Monkey achieves state-of-the-art performance among 2B-parameter MLLMs on various general multimodal understanding tasks and document understanding benchmarks. Key highlights include:

OCRBench: Mini-Monkey achieves a score of 802, outperforming the 8B-parameter model InternVL2-8B.
General Multimodal Understanding: Demonstrates leading performance on benchmarks like MathVista, RealWorldQA, AI2D, SEED Image, and others.
Document Understanding: Shows significant improvements in benchmarks such as DocVQA, ChartQA, InfoVQA, TextVQA, and OCRBench.

Implications and Future Prospects

The implications of this research are multifaceted. Practically, Mini-Monkey offers an efficient yet robust solution for enhancing lightweight MLLMs, making them more viable for real-world applications where computational resources are limited. Theoretically, the introduction of multi-scale adaptive cropping and targeted token compression provides a new direction for optimizing high-resolution image processing in MLLMs.

Future developments could extend the MSAC and SCM techniques to other architectures and explore further optimizations in token selection and representation fusion. Additionally, the principles established in this work could pave the way for more granular control over semantic consistency in image processing tasks, ultimately contributing to the development of more sophisticated multimodal systems.

Conclusion

Mini-Monkey represents a significant advancement in the capability of lightweight multimodal LLMs to process high-resolution images. By addressing the sawtooth effect through innovative cropping and compression strategies, this work not only improves performance on existing benchmarks but also sets a new precedent for efficient image preprocessing in MLLMs. The findings presented in this paper underscore the importance of adaptive, multi-scale approaches in high-resolution image understanding within the field of multimodal AI.