ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

Published 26 Apr 2026 in cs.CV and cs.CL | (2604.23813v1)

Abstract: Multimodal LLMs (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state-of-the-art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine-grained cross-modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces ShredBench, a novel benchmark for assessing semantic reasoning in multimodal LLMs through shredded document reconstruction.
The paper details an automated pipeline that synthesizes realistic document fragmentation using 3D rendering, Voronoi tessellation, and simulated deformations.
The paper reveals that current MLLMs exhibit drastic performance degradation with increased fragmentation, highlighting significant gaps in visual and textual integration.

ShredBench: A Benchmark for Semantic Reasoning in Multimodal LLMs via Shredded Document Reconstruction

Introduction

"ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction" (2604.23813) introduces a rigorous benchmark for probing the semantic and cross-modal reasoning robustness of state-of-the-art Multimodal LLMs (MLLMs) in scenarios where documents are not merely noisy or blurred but physically fragmented ("shredded"). ShredBench delivers a scalable, automated pipeline for generating such benchmarks and provides detailed empirical analysis across multiple data modalities (text, code, and tables), languages, and fragmentation levels. The evaluation exposes severe deficiencies in current leading MLLMs’ ability to integrate visual and semantic cues under significant structural disruption, establishing a critical empirical baseline for future research in VRDU robustness.

The ShredBench Benchmark Design

Data Acquisition and Fragmentation Pipeline

ShredBench encompasses diverse content domains: English/Chinese news, source code (Python, C++, Java), and scientific tables. Content is synthetically "shredded" via a physically realistic 3D rendering pipeline, involving:

High-resolution document rendering: Source text is rendered with randomized noise and fonts for realism.
Voronoi-based tessellation: Documents are partitioned into 8, 12, or 16 irregular fragments, simulating the artifact of manual shredding.
3D deformation and simulation: Fragments are subjected to crumpling, lighting, and rotation via Blender, further suppressing low-level visual shortcuts.
Task formulation: Models are tasked with reconstructing the original text sequence from the unordered, possibly rotated and occluded fragments.

The generation process is fully automated and supports integration of never-before-seen documents, enabling clean train/test splits with no data contamination.

Figure 1: ShredBench’s pipeline synthesizes shredded document images from diverse content and simulates realistic physical artifacts.

The resulting dataset comprises 756 documents across four scenarios (English, Chinese, code, tables) and three fragmentation granularities, with rigorous quality control ensuring solvability by humans (Cohen's $\kappa = 0.79$ ).

Dataset Properties

Input length distributions are balanced across domains, yielding a challenging range of real-world-like document complexities.

Figure 2: ShredBench dataset input lengths, indicating broad coverage across document categories.

Evaluation Metrics and Protocols

The task is formally a set-to-sequence mapping: unordered fragmented images $\mathcal{I} = \{f_1,\ldots,f_N\} \rightarrow$ target sequence $\hat{T}$ . Evaluation employs:

Normalized Edit Distance (NED): Token-level edit distance, lower is better.
TEDS: Structure- and content-aware, mainly for tables.
BLEU / ROUGE-L: Standard summarization metrics aligning with n-gram and sequence overlap.
CodeBLEU: AST- and data-flow-aware, critical for precise structural assessment of code restoration.

Decoding protocols enforce zero-shot, deterministic settings, and postprocessing removes all extraneous markup and whitespace.

Empirical Results

Aggregate Model Results

Evaluation covers 14 MLLMs (GPT-5, Gemini 3, Qwen-VL, InternVL, Mistral3-Reas, DeepSeek-OCR, etc.). Both proprietary and open-source models are evaluated across all benchmarks.

Intact document performance is consistently high, but reconstruction scores collapse dramatically as fragmentation increases, with NED rising and BLEU/ROUGE falling across all models.
Gemini 3 Pro exhibits the best overall performance (NED 0.33 at 8 fragments; ROUGE 0.83), but even this model shows major degradation with higher fragmentation, especially on more structured or dense tasks.
Figure 3: Radar chart of ShredBench performance, highlighting large gaps in MLLMs’ semantic reasoning under fragmentation.

Modality and Language-specific Performance

Natural Language: Models achieve better reconstruction in English than Chinese, a consequence of logogram density and segmentation sensitivity. Minor physical tears in Chinese often obliterate semantic content, compounding metric penalties.
Source Code: AST-aware metrics reveal even state-of-the-art models routinely fail to reconstruct logical order and indentation, especially in whitespace-dependent languages (Python). CodeBLEU scores for vendor models (e.g., Gemini 3 Pro: 0.77 in Python @ N=8) significantly exceed open-source models.
Tables: Rigid 2D structures exacerbate model weaknesses; Gemini 3 Flash, despite inferior text performance, surpasses Pro models on tabular data.

Granularity and Robustness

Performance decays linearly with increasing fragment count for most models. However, advanced proprietary architectures show slower decay, suggesting that scaling model size and architectural complexity improve—but do not solve—long-context, cross-fragment reasoning.

Qualitative Error Analysis

Case studies reveal characteristic failure modes.

Visual-Semantic Bridging: Success is marked by the model recovering contiguous tokens across physically bisected fragments, and partially preserving layout.
Ordering and Layout Errors: Models misplace code lines or hallucinate paragraph boundaries based on misleading spatial gaps.
Figure 4: Example of a successful news reconstruction; minor layout segmentation error and complete token recovery across fragments.

Figure 5: Code reconstruction failure; ordering errors and loss of narrow-line fragments typically discarded as visual noise.

Control Experiments: Semantic Reasoning vs. Visual Matching

Ablation tests on randomized "nonsense" text demonstrate that models’ performance collapses without semantic context (e.g., Gemini 3 Pro ROUGE drops from 0.73 to 0.33 at 16 fragments), confirming that ShredBench cannot be solved by visual jigsaw assembly alone and requires genuine semantic bridging.

Implications and Future Directions

The results illustrate a clear limitation in how MLLMs currently align local visual positional embeddings with global semantic continuity, especially when faced with non-canonical, physically disrupted inputs. This has immediate theoretical implications:

Semantic Reassembly: The gap between human and model performance in semantic reassembly under fragmentation exposes fundamental limitations in present-day visual-textual fused architectures. End-to-end OCR-free models and recent vision-language transformers with fine-grained patch alignment do not solve the problem.
Compositional Robustness: Recovery of logical structure (especially in 2D tables or code) remains an open challenge, suggesting the need for architectural innovations in set-to-sequence and permutation-invariant reasoning.
Robust VRDU: Real-world document workflows (forensics, archival restoration, legal discovery) often require robust recovery from partial or physically damaged sources; current MLLMs fail badly in these critical domains.

This benchmark establishes clear empirical lower bounds and will drive research on algorithmic advances in VRDU, meta-learning for physical artifact recovery, explicit modeling of spatial syntax, and cross-modal attention mechanisms. Additionally, real-world extensions—e.g., overlapping/occluded fragments, variable lighting, and real paper textures—remain untested and represent clear opportunities for future work.

Conclusion

ShredBench defines a new standard for stress-testing the semantic reasoning capabilities of MLLMs under severe structural noise. Model performance reveals that state-of-the-art architectures—even those approaching human parity on intact document tasks—still fail at reconstructing global semantics from shredded input, especially in dense or highly structured domains. Future progress in robust document understanding will require deeper integration of permutation-invariant, context-aware, and language-prior-driven reasoning mechanisms.

Markdown Report Issue