REORM: Reasoning-Enhanced Object Removal
- The paper introduces REORM, a modular framework that uses MLLMs to perform causal and context-aware object removal through recursive reasoning and refined segmentation.
- It integrates MLLM-driven analysis, mask-guided segmentation, and diffusion-based inpainting to eliminate both primary targets and all interaction-dependent artifacts.
- Benchmark evaluations demonstrate REORM’s state-of-the-art performance in perceptual similarity and artifact removal against both open-source and API-closed models.
Reasoning-Enhanced Object Removal with MLLM (REORM) is a modular framework for image-based object removal that operationalizes Interaction-Consistent Object Removal (ICOR). Unlike traditional object removal, which targets only explicitly named objects, ICOR formalizes the requirement to remove all associated interaction evidence—such as shadows, physically connected items, target-produced elements, and contextually linked objects—ensuring the semantic and visual coherence of edited images. REORM leverages multimodal LLMs (MLLMs) for causal and contextual reasoning about what must be excised from the scene, integrates mask-guided segmentation and inpainting, and introduces a self-correction mechanism for error detection and iterative refinement (Huang et al., 1 Feb 2026).
1. Motivation and Problem Formalization
Image object removal models have historically been unable to guarantee that all interaction evidence is eliminated following user instructions. Direct inpainting of a specified region often leaves artifacts or semantically inconsistent remnants: for instance, removing a person but leaving behind their shadow, a rider’s bike, or contextual signage that no longer has referent (Huang et al., 1 Feb 2026).
Interaction-Consistent Object Removal (ICOR) is formulated as follows: Given input image and instruction , produce an edited image where both named targets and all semantically or physically dependent elements are removed while preserving global plausibility: Here, is the union mask comprising both primary targets and all interaction-dependent elements.
This generalization exposes critical limitations of standard approaches, including the persistence of lighting-dependent artifacts (shadows, reflections), physically attached objects (leftover bikes, leashes), target-produced traces (splashes, footprints), and contextually invalid objects (e.g., standalone fire-extinguisher signs).
2. Framework Architecture and Core Modules
REORM’s architecture consists of three principal modules: MLLM-driven analysis, mask-guided removal, and a self-correction mechanism (Huang et al., 1 Feb 2026).
2.1 MLLM-Driven Reasoning
A chain-of-thought prompt is issued to a multimodal LLM (such as GPT-4o), which receives and recursively infers:
- Primary targets specified in
- All secondary elements whose presence is causally linked to the target(s)
- Which items must be collectively removed
The output is a removal list , e.g., . The recursive reasoning capability ensures capture of nested dependencies (e.g., a dog's leash and its shadow).
2.2 Mask Generation and Segmentation
Each entry is submitted to an open-vocabulary segmentation model (Grounded SAM), which yields individual binary masks . Masks are aggregated: and refined with a morphological closing operator ( pixels) to ensure contiguity and completeness:
2.3 Context-Aware Inpainting
The refined mask is used by a diffusion-based inpainting network (ObjectClear) to generate , synthesizing background consistent in lighting, texture, and semantics.
2.4 Self-Correction Mechanism
To eliminate persistent or hallucinated artifacts, REORM applies a self-correction pipeline:
- Simulation: An MLLM generates the expected scene description post-removal, conditioned on .
- Examination: A separate MLLM compares to output a correction list of any leftover or spurious entities.
- Second-Pass Removal: Each is segmented and re-inpainted via an attention-redirected inpainting model (Attentive Eraser).
- Final Output: Yields the finalized image .
An optional joint loss can be conceptualized as the sum of a mask consistency loss and a correction penalty:
3. Local-Deployment Variant
REORM-Local is designed for deployment on a single 24 GB GPU, eliminating dependency on proprietary cloud APIs. The analysis stage is decomposed into explicit sub-queries—target detection, element enumeration, inconsistency reasoning, and consolidation—using llava-vicuna-13b (quantized) for image-grounded reasoning and llama-3.1-8b-Instruct for text processing. This variant omits the self-correction pass for efficiency. The resultant inference time is approximately 10–12 seconds, compared to 17 seconds for GPT-4o (Huang et al., 1 Feb 2026).
4. Benchmarking and Empirical Evaluation
The ICOREval benchmark was constructed to comprehensively evaluate interaction-consistent removal:
- 110 instructions/images, each requiring removal of primary and interaction-linked objects
- Four interaction categories: (i) lighting-dependent (53), (ii) physically connected (35), (iii) target-produced (16), (iv) contextual (28)
- Sourced from public video pairs, synthetic insertions, and manual copy-paste scenarios
Quantitative metrics include DINO (perceptual similarity, ↑), LPIPS (↓), PSNR (↑), SSIM (↑), and runtime on a 24 GB GPU. REORM achieved state-of-the-art results against both open-source (MGIE, SmartEdit) and API-closed models (Nano Banana):
| Method | DINO↑ | LPIPS↓ | PSNR↑ | SSIM↑ | Runtime |
|---|---|---|---|---|---|
| MGIE* | 0.709 | 0.261 | 19.217 | 0.634 | 5.2 s |
| SmartEdit* | 0.760 | 0.275 | 20.064 | 0.615 | 10.09 s |
| Nano Banana† | 0.873 | 0.124 | 25.225 | 0.771 | 8.45 s |
| Ours (Local) | 0.897 | 0.121 | 25.674 | 0.820 | 12.54 s |
| Ours (GPT-4o) | 0.937 | 0.104 | 27.063 | 0.825 | ~17 s |
(*open-source local, †closed-source API) (Huang et al., 1 Feb 2026).
Qualitative analyses indicate REORM's superiority in handling complex cases involving interconnected objects (e.g., riders with bicycles, persons with shadows, dogs with leashes), where baselines commonly leave residuals.
5. Relationship to Other Reasoning-Enhanced Editing Models
REORM adopts the broader paradigm of reasoning-enhanced image editing established by systems such as ReasonEdit. The “thinking and reflection” loop of ReasonEdit—where the MLLM decomposes ambiguous instructions into actionable steps (thinking) and iteratively assesses/corrects the edit (reflection)—is directly transferable to object removal settings (Yin et al., 27 Nov 2025). The thinking stage outputs both segmentation and inpainting sub-instructions. The reflection rounds compute artifact-specific scores (e.g., masked LPIPS) and prompt the MLLM for corrective actions if residuals or color artifacts remain.
A key distinction with methods such as EraseLoRA (Jo et al., 25 Dec 2025) is REORM’s explicit modeling of interaction dependencies, rather than focusing solely on decoupling foreground exclusion and subtype aggregation for faithful inpainting. While EraseLoRA utilizes an MLLM for reasoning about background and non-target foreground segmentation, it does not formalize recursive interaction consistency. REORM also contrasts with 3D-centric approaches such as REALM (Shi et al., 18 Oct 2025), which grounds reasoning across rendered views and fuses global-to-local segmentation for object removal in Gaussian Splatting scenes.
6. Failure Modes, Limitations, and Future Work
Empirical analyses highlight the following limitations of REORM:
- Over-editing: The self-correction module may remove background objects if omitted from the MLLM-generated scene description.
- Intent mismatch: Rigid adherence to total interaction erasure sometimes contradicts user intent, particularly where stylistic or context-dependent objects should be preserved.
- Efficiency-accuracy trade-off: The local-only variant is less effective at eliminating subtle artifacts due to omission of the full correction cycle.
Proposed future directions include the integration of adaptive intent modeling (enabling dialogic clarification), joint end-to-end training of the reasoning, segmentation, and inpainting modules with ICOR-style ground truth, and user-preference conditioning to modulate the strength of semantic consistency versus stylistic freedom (Huang et al., 1 Feb 2026).
7. Implementation Details
REORM is realized as a fully plug-and-play pipeline, requiring no fine-tuning of backbone models. The default implementation uses:
- MLLMs: GPT-4o (cloud) or llava-vicuna-13b-hf + llama-3.1-8b (local)
- Segmentation: Grounded SAM (ViT-H backbone)
- Inpainting: ObjectClear (diffusion), Attentive Eraser (for artifact-focused correction)
- Morphological kernel: pixels
The process can be reproduced directly from the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
analysis_text = MLLM_Analyzer.prompt(T, I) L = parse_removal_list(analysis_text) M_all = set() for elem in L: M_elem = GroundedSAM.segment(I, elem) M_all |= M_elem M_refined = MorphClose(M_all, k=5) I1 = ObjectClear.inpaint(I, M_refined) D_exp = MLLM_Simulator.prompt(I, L) C_text = MLLM_Examiner.prompt(I1, D_exp) C = parse_removal_list(C_text) if C: M_corr = set() for c in C: M_c = GroundedSAM.segment(I1, c) M_corr |= M_c I_final = AttentiveEraser.inpaint(I1, M_corr) else: I_final = I1 return I_final |