RADR: Relation-Aware Design Reconstruction
- The paper introduces RADR, a framework that leverages relation graphs and multi-modal models to achieve structure-preserving design layout editing.
- It formulates layout modifications as a self-supervised reconstruction problem using serialized edge sequences and standardized operations.
- The approach demonstrates superior accuracy and efficiency by integrating geometric relation awareness into design editing compared to existing methods.
Relation-Aware Design Reconstruction (RADR) is a framework for autonomous design layout editing that achieves structure-preserving modifications in graphical designs. Conceived as the core architectural element of ReLayout, RADR addresses the challenges of ambiguous natural language instructions and limited annotation data, formulating layout editing as a self-supervised reconstruction problem informed by explicit element relations and standardized editing operations. The approach unifies multiple editing actions (add, delete, move, resize) within a multi-modal LLM (MLLM) backbone, enabling versatile, data-efficient, and accurate design editing while robustly maintaining the geometric structure of unedited regions (Lin et al., 1 Feb 2026).
1. Formalization of Design Elements and Relation Graph
In RADR, a design is a set of elements , where each is characterized by its content (image or text) and geometric attributes . For images, specifies position and size; for text , with as font size, as angle, and as alignment.
RADR introduces a relation graph encoding layout structure. Nodes comprise design elements and the canvas node . Directed edges describe pairwise relations, partitioned into size and position :
- Size relations: Area ratio between elements is classified as “small”, “equal”, or “large” with tolerance :
No size edges involve the canvas.
- Position relations: Each reference bounding box defines a grid; relations are {TL, T, TR, L, C, R, BL, B, BR} according to source center location relative to the target.
In practice, rather than dense adjacency tensors, RADR serializes relations into edge sequences for input.
2. Self-Supervised Reconstruction Objective
The training process emulates layout editing as a conditional attribute reconstruction problem. The model predicts edited attributes , given design contents , the (possibly pruned) relation graph , and a synthesized editing operation . The objective is the negative log-likelihood (NLL):
where are attribute tokens in an autoregressive factorization.
Weight-decay regularization is applied to LoRA-adapted LLM parameters. To synthesize supervision, a random operation is sampled and edges related to the target element are removed from , compelling the model to infer the new attributes from remaining structure and . This process bypasses the need for explicit (original, operation, edited) triplets.
3. Standardized Editing Operations and Data Synthesis
Every operation is represented in tuple format: , with . Actions are defined as:
| Action | Target | Parameters |
|---|---|---|
| add | index in validation | none |
| delete | index in training | none |
| move | index in training | new |
| resize | index in training | new |
For self-supervised samples, a design is selected, is built, operation is sampled, affected edges are removed, and the model learns from .
4. Multi-Modal Model Architecture
RADR leverages an MLLM composed of:
- Vision encoder: (e.g., CLIP ViT-L/14, frozen), producing visual tokens per image element.
- Projector: A 2-layer MLP with GELU maps vision embeddings to LLM token space.
- LLM backbone: (e.g., Llama-3.1-8B), which receives all tokenized inputs and autoregressively emits attribute predictions as JSON.
Inputs are concatenated from: projected image tokens, text tokens, serialized relation tokens, and operation tokens (e.g., “MOVE element 3 TO (120, 450)”). This deck of tokens is processed with positional embeddings. The output is parsed to extract new attributes for rendering.
5. Training and Inference Process
5.1 Self-Supervised Fine-Tuning
- Dataset: Crello v4, approximately 23k designs, filtered to exclude designs with over 25 elements.
- Sampling: For each design , build , synthesize , remove edges for the operation target, yielding the input for one training sample.
- Optimization: Backbone Llama-3.1-8B (LoRA-adapted, AdamW optimizer, lr = $2$e-$4$, weight decay $0.01$); frozen vision encoder; LoRA rank 32; batch size 64; trained on 8 GPUs for approximately 50k steps.
5.2 Inference Flow
Given a new input :
- Extract current element content and attributes .
- Build relation graph (removing affected edges).
- Tokenize .
- Forward pass through the fine-tuned MLLM.
- Parse JSON to obtain .
- Render final design from predicted attributes.
6. Experimental Results and Comparative Analysis
RADR was benchmarked against GPT-4o (multi-modal assistant), FlexDM (masked-field layout generation), LaDeCo (layered design composer), and PosterLLaVA (relation-conditioned layout). Key evaluation metrics include design/layout, content, typography/color, graphics/images, innovation, overlap (Ove, lower better), alignment (Ali, lower better), size relation preservation, position relation preservation, and operation accuracy.
| Model | Design/Layout–Innovation | Ove | Ali | Size-Rel Pres. | Pos-Rel Pres. | Op-Acc |
|---|---|---|---|---|---|---|
| RADR (Ours) | 8.25–7.10 | 0.0996 | 0.0013 | 0.9150 | 0.8684 | 0.9991 |
| GPT-4o | 7.41–6.37 | 0.1942 | 0.0011 | 0.8386 | 0.5544 | 0.9983 |
| FlexDM | 5.34–4.54 | 0.3242 | 0.0016 | – | – | – |
| LaDeCo | 8.08–6.98 | 0.0865 | 0.0013 | – | – | – |
| PosterLLaVA | – | – | – | 0.8822 | 0.8458 | – |
In the generalization setting, RADR further improves size and position relation preservation (0.9475 and 0.9157, respectively), outperforming all baselines, including GPT-4o (0.8447 and 0.5444). Human preference surveys (200 samples) indicate significantly higher preference for RADR output over GPT-4o, particularly regarding visual quality and structural preservation (up to 86.0% preferred structure retention in both reconstruction and generalization).
7. Ablation Studies and Structure Preservation
Ablation studies demonstrate critical contributions of RADR and the explicit relation graph. Removing RADR or the relation graph from the model substantially degrades size and position relation preservation (to and ), in contrast to the full model (0.9150 and 0.8684). Using a dense adjacency matrix is less effective than a serialized edge sequence.
| Setting | Design/Layout–Innovation | Ove | Ali | Size Rel | Pos Rel | Op |
|---|---|---|---|---|---|---|
| w/o RADR | 8.12–7.01 | 0.1007 | 0.0011 | 0.8008 | 0.3750 | 0.9978 |
| w/o relation graph | 8.21–7.07 | 0.0980 | 0.0013 | 0.7943 | 0.3751 | 0.9987 |
| w/ matrix | 8.17–7.04 | 0.0954 | 0.0012 | 0.8892 | 0.8529 | 0.9909 |
| Ours (serialized) | 8.25–7.10 | 0.0996 | 0.0013 | 0.9150 | 0.8684 | 0.9991 |
These results confirm the necessity of both relation awareness and the reconstruction formulation for reliable structure preservation in automatic design layout editing (Lin et al., 1 Feb 2026).