OmniText-Bench: 1D Controllable Image-Text Testbed

Updated 24 January 2026

The testbed presents a controllable framework for fine-grained text insertion, removal, and editing using diffusion-based models with advanced attention control.
It employs precise annotation protocols, including input masks and style references, to ensure reproducible evaluations and benchmark diverse text-image tasks.
Standardized metrics such as PSNR, MS-SSIM, and FID rigorously assess both content accuracy and style fidelity in manipulated text regions.

A controllable 1D image-text testbed, exemplified by the OmniText-Bench protocol, provides a structured experimental environment for research in controllable text-image manipulation (TIM). This testbed is designed to evaluate methods handling fine-grained, spatially localized text insertion, removal, editing, style transfer, and more, using precise annotation schemas and standardized evaluation routines. The architecture centers on diffusion-based generative models augmented for advanced attention manipulation and latent optimization. It includes task pipelines for developing, benchmarking, and comparing TIM systems under unified and reproducible conditions (&&&^{^{^{^{1FLASH1^{^{^{^&&&).}}}}}}}

1. Dataset Structure and Annotation Protocol

Each sample within the OmniText-Bench testbed consists of four primary components:

Input Image PRESERVED_PLACEHOLDER_^{^{^{^{1FLASH1^{^{^{^}}}}}}}: A $512 \times 512$ RGB image or cropped region that may contain existing text relevant to the specified TIM operation.
Target Mask $(M)$ : A binary mask ( $M(p) = 1$ for pixels $p$ to be removed, edited, or inserted; $0$ otherwise), registered to the input image to identify operative regions.
Target Text $(T)$ : The ground-truth string, e.g., "^{^{^{^{^{^{^{^1FLASH1}}}}}}} specifying the intended textual content of the region of interest.
Style Reference $(I_{\mathrm{ref}}, m_{\mathrm{ref}})$ : Optional. An image and mask denoting the region(s) whose text style (font, color, stroke) serve as a target for style-aware manipulation. For tasks without explicit style transfer, $(I_{\mathrm{ref}}, m_{\mathrm{ref}}) = (I, M)$ .

Annotations are organized by sample directory, with files input.png (image), mask.png (mask), text.txt (target text), and—if necessary—ref.png and ref_mask.png for style reference.

Task mappings to these elements are as follows:

Task	Input $I$	Target $512 \times 512$ ^{^{^{^{1FLASH1^{^{^{^}}}}}}}	Mask $512 \times 512$ 1	Style Ref $512 \times 512$ 2
Text Removal	Contains text	"" (empty)	Where text is	None
Text Insertion	Blank in $512 \times 512$ 3	New string	Where to insert	$512 \times 512$ 4
Text Editing	Old text in $512 \times 512$ 5	New string	Where to edit	$512 \times 512$ 6
Text Rescaling	Old text present	New string	$512 \times 512$ 7	$512 \times 512$ 8
Text Repositioning	Original text	Unchanged	$512 \times 512$ 9	$(M)$ ^{^{^{^{1FLASH1^{^{^{^}}}}}}}
Style-Based Operation	As above	Varies	Varies	$(M)$ 1

This structure enables isolation and control of content and style manipulations in both input and ground-truth specifications (&&&^{^{^{^{1FLASH1^{^{^{^&&&).}}}}}}}

2. Evaluation Metrics

Evaluation is conducted using standardized, image-informed and text-informed metrics:

Text Removal: Assessed over the full $(M)$ $(M)$ 2 output using
- PSNR (Peak Signal-to-Noise Ratio, $(M)$ 3)
- MS-SSIM ( $(M)$ 4, $(M)$ 5)
- FID (Fréchet Inception Distance, $(M)$ 6)
Text Insertion, Editing, Rescaling, Repositioning: Assessed on cropped regions defined by $(M)$ $(M)$ 7 or $(M)$ $(M)$ 8, using
- Content Accuracy: Word-level accuracy (ACC, $(M)$ 9, $M(p) = 1$ ^{^{^{^{1FLASH1^{^{^{^),}}}}}}} Normalized Edit Distance (NED, $M(p) = 1$ 1) via scene-text OCR.
- Style Fidelity: Pixel-wise MSE ( $M(p) = 1$ 2, $M(p) = 1$ 3), PSNR ( $M(p) = 1$ 4), MS-SSIM ( $M(p) = 1$ 5, $M(p) = 1$ 6), FID ( $M(p) = 1$ 7).

For style-aware tasks, fidelity metrics are computed over style-referenced insertions, ensuring that both semantic content and style attributes are jointly evaluated (&&&^{^{^{^{1FLASH1^{^{^{^&&&).}}}}}}}

3. Model Architecture and Attention Mechanisms

OmniText builds upon a pre-trained, latently diffused text-inpainting U-Net backbone (TextDiff-2), incorporating two dedicated operational modules:

Text Removal (TR) performed via attention modulation at sampling time:
- Self-Attention Inversion (SAI): Inverts self-attention activations over masked regions in early sampling steps to suppress residual focus on removed text:
$M(p) = 1$ 8

where $M(p) = 1$ 9 is the self-attention matrix for U-Net decoder block $p$ ^{^{^{^{1FLASH1^{^{^{^.}}}}}}} - Cross-Attention Reassignment (CAR): Constrains cross-attention to align only with start/end tokens:

$p$ 1

SAI is applied during the first $p$ 2 of sampling steps; CAR is applied at all steps.
Controllable Inpainting (CI) enabled by latent-space optimization:
1. Grid construction: $p$ 3 where $p$ 4 is a possibly shrunk mask for editing.
2. Latent optimization proceeds during the first $p$ 5 of diffusion steps. At each step, losses are computed and gradients are backpropagated to $p$ 6.

The combined use of direct attention manipulation and optimization-driven inpainting delivers fine-grained control over both appearance and semantics in the manipulated text regions (&&&^{^{^{^{1FLASH1^{^{^{^&&&).}}}}}}}

4. Latent Optimization: Loss Formulations and Update Procedure

Two novel loss functions stabilize and guide latent inpainting:

Cross-Attention Content Loss ( $p$ 7): Ensures content-accurate rendering of each character by maximizing cross-attention activations for each target token over sub-mask regions. The loss for each character sub-mask $p$ 8 and token $p$ 9 is formulated:

$0$^{^{^{^{1FLASH1^{^{^{^}}}}}}}

employing Focal Loss with $0$1.

Self-Attention Style Loss ($0$2): Aligns the distribution of self-attention within the masked region to a normalized reference mask $0$3 by minimizing the KL divergence:

$0$4

The total loss is $0$5 (default weights: $0$6). Latent variables $0$7 are updated by Adam with a learning rate of $0$8 during the specified optimization stages ($0$9, $(T)$ ^{^{^{^{1FLASH1^{^{^{^,}}}}}}} $(T)$ 1 along the diffusion timeline) (&&&^{^{^{^{1FLASH1^{^{^{^&&&).}}}}}}}

5. End-to-End Testbed Instantiation and Usage

The testbed runtime consists of well-defined pipelines and hyperparameters:

Sampling steps: 2^{^{^{^{1FLASH1^{^{^{^}}}}}}} total (TR + CI)
SAI steps: first 1^{^{^{^{1FLASH1^{^{^{^}}}}}}} (for TR tasks)
Latent optimization: performed at $(T)$ 2, $(T)$ 3, $(T)$ 4 of the timeline
Adam LR: $(T)$ 5
Loss weights: $(T)$ 6, $(T)$ 7

Reference pseudocode specifies the operational logic for all task types, leveraging a function such as denoise_step(z, mask, grid, token_embed) to carry out diffusion steps with attention hooks. The pipeline branches depend on the task—TR for erasure, CI for insertion/edit/transfer—using the provided sample masks, text, and style references. Example CLI commands clarify reproducibility, such as:

$(T)$ 8

Generated outputs are benchmarked against ground-truth data using the metrics and crop protocols stipulated above (&&&^{^{^{^{1FLASH1^{^{^{^&&&).}}}}}}}

6. Research Significance and Applications

OmniText-Bench establishes a versatile and controllable 1D image-text testbed, directly enabling universal TIM research and development:

Broader TIM Applicability: Encompasses removal, insertion, arbitrary editing, geometric and stylistic variants.
Fine-Grained Style Control: Through referential guidance, it supports tasks involving heterogeneous font, stroke, and color transfer.
Unified Metrics and Protocols: Enables rigorous comparison across methods, including both generalist and specialist architectures.

A plausible implication is that this design paradigm facilitates rapid iteration for both foundational TIM models and downstream applied research in real-world signage, document editing, and graphic design scenarios. All operational details, annotation formats, and benchmarking routines are specified for transparent reproduction (&&&^{^{^{^{1FLASH1^{^{^{^&&&).}}}}}}}

Markdown Report Issue Upgrade to Chat

References (1)

OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Controllable 1D Image-Text Testbed.