MiLDEAgent: Multi-Layer Document Editing

Updated 15 January 2026

MiLDEAgent is a modular, reasoning-driven framework that supports structure-preserving editing of complex documents including scientific papers and design posters.
It employs a sequential pipeline combining region segmentation, command reformulation, and multimodal execution using both generalist LMMs and specialized image editors.
Experimental results demonstrate enhanced instruction adherence and layout fidelity, setting new baselines compared to open-source document editing methods.

A Multi-Layer Document Editing Agent (MiLDEAgent) is a modular, reasoning-driven framework for localized, structure-preserving editing of complex documents—including scientific papers, design posters, and structured PDFs—based on natural language instructions. MiLDEAgent decomposes editing tasks into tightly orchestrated stages, combining multimodal vision-language reasoning, precise region segmentation, command reformulation, and modular execution via either generalist large multimodal models (LMMs) or specialized image editors. The architecture is designed to preserve content fidelity and layout integrity, addressing the unique challenges arising from multi-layer composition and atomic edit operations (Suri et al., 2024, Qian et al., 9 Aug 2025, Lin et al., 8 Jan 2026).

1. Architectures and Modular Pipeline

MiLDEAgent can be instantiated through layered architectures that address both structural and semantic complexity of document editing. The foundational DocEdit-v2 framework employs three sequential modules:

Doc2Command Layer: Utilizes a Vision Transformer (ViT) backbone and dual decoders to (a) localize editable regions (Region-of-Interest, RoI) via semantic segmentation, and (b) decode structured, software-style edit commands C₀ as token sequences of the form ACTION(<Component>, <Attribute>, <Init>, <Final>).
Command Reformulation Layer: Leverages LLMs (e.g., GPT-4, Gemini) to transform underspecified software-centric commands into concise, LMM-friendly instructions C*, tailoring output to the prompt schema expected by the execution backend.
Multimodal Execution Layer: Combines the original document (HTML+CSS or image), RoI bounding box, and C* to either generate edited HTML+CSS (rendered via headless browser) or directly edit document images.

A textual diagram of the three-layer pipeline:

┌──────────────┐ → ┌────────────────────┐ → ┌──────────────────────┐ → ┌──────────────────────┐
│ User Utterance│   │ Doc2Command         │   │ Command Reformulation│   │ Multimodal Execution │
│ + Document I  │   │ (RoI + Command C₀) │   │   (produce C*)      │   │  (edit HTML+CSS/image)│
└──────────────┘   └────────────────────┘   └──────────────────────┘   └──────────────────────┘

(Suri et al., 2024)

In scientific document processing, DocRefine extends the concept to a six-agent pipeline covering layout parsing, multimodal semantic analysis, instruction decomposition, iterative content refinement, summarization, and feedback verification, operating in a closed-loop for maximal fidelity (Qian et al., 9 Aug 2025).

MiLDEdit further advances layer-aware editing by integrating a VLM-based reasoner trained with Group Relative Policy Optimization (GRPO) RL, explicitly deciding which RGBA layers to edit and generating layer-conditioned prompts for each atomic modification (Lin et al., 8 Jan 2026).

2. Reasoning and Edit Localization Mechanisms

Localized, layer-aware editing is achieved via structured grounding and reasoning:

Doc2Command (Structure and Mask Generation)

Vision Transformer encoder processes image patches overlaid with user instruction.
Semantic segmentation splits input into “RoI,” “U-text,” and “background” classes (K=3).
Largest connected component and centroid thresholding extract RoI bounding box $b = [x, y, h, w]$ .
Simultaneous decoding of textual command C₀ ensures coupling between localization (“where”) and operation (“what”).

High-level pseudocode: $(D, L_i, I_D)$ 0 (Suri et al., 2024)

RL-Based Layer Reasoner (MiLDEdit)

State: $(D, L_i, I_D)$ (document, candidate layer, instruction).
Action: $(y_i, I_i)$ (binary edit decision and layer-specific prompt).
Per-layer reward $\mathcal{R}_i$ combines format validity, layer-decision accuracy, and BLEU match for instruction.
GRPO applies group-normalized advantage and PPO-style surrogate with KL regularization.

GRPO policy objective: $J_\mathrm{GRPO}(\phi) = \mathbb{E}\left[ \sum_{i=1}^G \min \left( r_i(\phi)A_i, \, \mathrm{clip}(r_i(\phi),1-\epsilon,1+\epsilon)A_i \right) - \beta D_\mathrm{KL}(\pi_\phi \| \pi_\mathrm{ref}) \right]$ (Lin et al., 8 Jan 2026)

3. Command Reformulation and Layer-Specific Prompting

Multi-stage command reformulation addresses ambiguity and adapts to the expectations of different editing backends:

Initial commands often lack specification required by generalist LMMs. Reformulation via LLMs inserts context, actionable details, and formats using angle-bracket schema (“<Action> <Component> <Attribute?> <InitialState?> <FinalState?>”).
For multi-layer design documents, VLM or LLM reasoners generate layer-conditioned prompts, ensuring each edit is contextually grounded before being dispatched to the editor (Suri et al., 2024, Lin et al., 8 Jan 2026).

4. Editing Execution and Composite Output

Editing agents orchestrate atomic operations across layers or structural regions:

In HTML+CSS-based flows, the grounded edit instruction is injected into a constrained template marking relevant regions. LMMs (e.g., GPT-4V, Gemini) produce modified markup, which is rendered to image.
In multi-layer document editing, only selected layers undergo modification by a frozen image editor, controlled via alpha masks and layer-prompts. Final output recomposes edited and untouched layers: $D' = L_1' \oplus L_2' \oplus \cdots \oplus L_n'$ (Lin et al., 8 Jan 2026)

Closed-loop feedback architectures (DocRefine) further introduce verification agents scoring semantic consistency, layout fidelity, and instruction adherence. Subthreshold outputs invoke iterative refinement until all criteria are satisfied (Qian et al., 9 Aug 2025).

5. Training, Datasets, and Benchmarking Protocols

MiLDEAgent components are trained/fine-tuned on multi-modal, multi-layer datasets with careful decomposition of edit instructions:

DocEdit-PDF: 17,808 image/request pairs with region-level annotations.
MiLDEBench: 20,000 design documents, average 4.45 layers per doc, 50,000 instructions, 87,000 atomic edit steps (Lin et al., 8 Jan 2026).
Joint fine-tuning employs multi-task objectives, e.g.,

$L_\mathrm{total} = \lambda_\mathrm{text} L_\mathrm{text} + \lambda_\mathrm{seg} L_\mathrm{seg}$

with cross-entropy for edit command tokens, focal and dice losses for segmentation; bounding-box regression is optional.

MiLDEEval benchmarks span:

Instruction Following (IF)
Layout Consistency (LC)
Aesthetics (A)
Text Rendering (TR)
MiLDEScore: Composite metric integrating normalized scores with a gated sigmoid on IF.

Sample scoring equations:

Semantic Consistency Score (SCS): $\mathrm{SCS} = \frac{1}{|J|}\sum_{j=1}^{|J|}\frac{\varphi(g_j)\cdot\varphi(p_j)}{\|\varphi(g_j)\|\|\varphi(p_j)\|}$

Layout Fidelity Index (LFI): $\mathrm{LFI} = \frac{1}{N}\sum_{i=1}^N \mathrm{SSIM}(I_i, I'_i)$ (Qian et al., 9 Aug 2025, Lin et al., 8 Jan 2026)

6. Experimental Results and Comparative Analysis

MiLDEAgent sets new performance baselines for instruction adherence and layout preservation relative to prior approaches:

Model	IF (%)	LC (%)	A	TR (%)	MiLDEScore (%)	Layer Acc (%)
MiLDEAgent 7B	20.71	93.24	4.19	36.75	25.90	80.46
MiLDEAgent 3B	13.29	90.15	4.32	27.52	16.10	42.90
Best open-source	~14.2	≤90	≤4.2	≤28.7	~14.17	–
Closed-source	~25	~58	~4.5	~40	~27.1	–

MiLDEAgent achieves MiLDEScore up to 25.9%, outperforming all open-source baselines while maintaining explicit layer-awareness (Lin et al., 8 Jan 2026). RL training of the reasoner yields a fourfold improvement in layer selection accuracy.

7. Limitations, Extensions, and Generalization

MiLDEAgent architectures exhibit several limitations:

Independent layer decisions may induce conflicting edits across adjacent layers.
Command specificity and prompt quality strongly affect edit precision.
Frozen image editors may inadequately process fine-grained textual or decorative requests, a limitation explicit in MiLDEdit (Lin et al., 8 Jan 2026).

Future improvements include integrating self-check/refinement loops for error correction, co-training editors with reasoning modules, and expanding the instruction interface to handle multi-turn dialogue and user feedback. A plausible implication is that these developments will enhance applicability to broader document genres, as preliminary results in contracts and medical reports demonstrate strong generalization with SCS≈0.80–0.85, LFI≈0.89–0.92, and IAR≈0.78–0.82 (Qian et al., 9 Aug 2025).

Overall, MiLDEAgent defines the contemporary paradigm for fine-grained, reasoning-based document editing across multi-layer, multimodal formats, supported by rigorous evaluation and modular, decomposable pipelines (Suri et al., 2024, Qian et al., 9 Aug 2025, Lin et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (3)

DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding (2024)

DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents (2025)

MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Layer Document Editing Agent (MiLDEAgent).