OmniText-Bench: 1D Controllable Image-Text Testbed
- The testbed presents a controllable framework for fine-grained text insertion, removal, and editing using diffusion-based models with advanced attention control.
- It employs precise annotation protocols, including input masks and style references, to ensure reproducible evaluations and benchmark diverse text-image tasks.
- Standardized metrics such as PSNR, MS-SSIM, and FID rigorously assess both content accuracy and style fidelity in manipulated text regions.
A controllable 1D image-text testbed, exemplified by the OmniText-Bench protocol, provides a structured experimental environment for research in controllable text-image manipulation (TIM). This testbed is designed to evaluate methods handling fine-grained, spatially localized text insertion, removal, editing, style transfer, and more, using precise annotation schemas and standardized evaluation routines. The architecture centers on diffusion-based generative models augmented for advanced attention manipulation and latent optimization. It includes task pipelines for developing, benchmarking, and comparing TIM systems under unified and reproducible conditions (&&&1FLASH1&&&).
1. Dataset Structure and Annotation Protocol
Each sample within the OmniText-Bench testbed consists of four primary components:
- Input Image PRESERVED_PLACEHOLDER_1FLASH1^: A RGB image or cropped region that may contain existing text relevant to the specified TIM operation.
- Target Mask : A binary mask ( for pixels to be removed, edited, or inserted; $0$ otherwise), registered to the input image to identify operative regions.
- Target Text : The ground-truth string, e.g., "1FLASH1 specifying the intended textual content of the region of interest.
- Style Reference : Optional. An image and mask denoting the region(s) whose text style (font, color, stroke) serve as a target for style-aware manipulation. For tasks without explicit style transfer, .
Annotations are organized by sample directory, with files input.png (image), mask.png (mask), text.txt (target text), and—if necessary—ref.png and ref_mask.png for style reference.
Task mappings to these elements are as follows:
| Task | Input | Target 1FLASH1^ | Mask 1 | Style Ref 2 |
|---|---|---|---|---|
| Text Removal | Contains text | "" (empty) | Where text is | None |
| Text Insertion | Blank in 3 | New string | Where to insert | 4 |
| Text Editing | Old text in 5 | New string | Where to edit | 6 |
| Text Rescaling | Old text present | New string | 7 | 8 |
| Text Repositioning | Original text | Unchanged | 9 | 1FLASH1^ |
| Style-Based Operation | As above | Varies | Varies | 1 |
This structure enables isolation and control of content and style manipulations in both input and ground-truth specifications (&&&1FLASH1&&&).
2. Evaluation Metrics
Evaluation is conducted using standardized, image-informed and text-informed metrics:
- Text Removal: Assessed over the full 2 output using
- PSNR (Peak Signal-to-Noise Ratio, 3)
- MS-SSIM (4, 5)
- FID (Fréchet Inception Distance, 6)
- Text Insertion, Editing, Rescaling, Repositioning: Assessed on cropped regions defined by 7 or 8, using
- Content Accuracy: Word-level accuracy (ACC, 9, 1FLASH1), Normalized Edit Distance (NED, 1) via scene-text OCR.
- Style Fidelity: Pixel-wise MSE (2, 3), PSNR (4), MS-SSIM (5, 6), FID (7).
For style-aware tasks, fidelity metrics are computed over style-referenced insertions, ensuring that both semantic content and style attributes are jointly evaluated (&&&1FLASH1&&&).
3. Model Architecture and Attention Mechanisms
OmniText builds upon a pre-trained, latently diffused text-inpainting U-Net backbone (TextDiff-2), incorporating two dedicated operational modules:
- Text Removal (TR) performed via attention modulation at sampling time:
- Self-Attention Inversion (SAI): Inverts self-attention activations over masked regions in early sampling steps to suppress residual focus on removed text:
8
where 9 is the self-attention matrix for U-Net decoder block 1FLASH1. - Cross-Attention Reassignment (CAR): Constrains cross-attention to align only with start/end tokens:
1
SAI is applied during the first 2 of sampling steps; CAR is applied at all steps.
- Controllable Inpainting (CI) enabled by latent-space optimization:
- Grid construction: 3 where 4 is a possibly shrunk mask for editing.
- Latent optimization proceeds during the first 5 of diffusion steps. At each step, losses are computed and gradients are backpropagated to 6.
The combined use of direct attention manipulation and optimization-driven inpainting delivers fine-grained control over both appearance and semantics in the manipulated text regions (&&&1FLASH1&&&).
4. Latent Optimization: Loss Formulations and Update Procedure
Two novel loss functions stabilize and guide latent inpainting:
- Cross-Attention Content Loss (7): Ensures content-accurate rendering of each character by maximizing cross-attention activations for each target token over sub-mask regions. The loss for each character sub-mask 8 and token 9 is formulated:
$0$1FLASH1^
employing Focal Loss with $0$1.
- Self-Attention Style Loss ($0$2): Aligns the distribution of self-attention within the masked region to a normalized reference mask $0$3 by minimizing the KL divergence:
$0$4
The total loss is $0$5 (default weights: $0$6). Latent variables $0$7 are updated by Adam with a learning rate of $0$8 during the specified optimization stages ($0$9, 1FLASH1, 1 along the diffusion timeline) (&&&1FLASH1&&&).
5. End-to-End Testbed Instantiation and Usage
The testbed runtime consists of well-defined pipelines and hyperparameters:
Sampling steps: 21FLASH1^ total (TR + CI)
- SAI steps: first 11FLASH1^ (for TR tasks)
- Latent optimization: performed at 2, 3, 4 of the timeline
- Adam LR: 5
- Loss weights: 6, 7
Reference pseudocode specifies the operational logic for all task types, leveraging a function such as denoise_step(z, mask, grid, token_embed) to carry out diffusion steps with attention hooks. The pipeline branches depend on the task—TR for erasure, CI for insertion/edit/transfer—using the provided sample masks, text, and style references. Example CLI commands clarify reproducibility, such as:
8
Generated outputs are benchmarked against ground-truth data using the metrics and crop protocols stipulated above (&&&1FLASH1&&&).
6. Research Significance and Applications
OmniText-Bench establishes a versatile and controllable 1D image-text testbed, directly enabling universal TIM research and development:
- Broader TIM Applicability: Encompasses removal, insertion, arbitrary editing, geometric and stylistic variants.
- Fine-Grained Style Control: Through referential guidance, it supports tasks involving heterogeneous font, stroke, and color transfer.
- Unified Metrics and Protocols: Enables rigorous comparison across methods, including both generalist and specialist architectures.
A plausible implication is that this design paradigm facilitates rapid iteration for both foundational TIM models and downstream applied research in real-world signage, document editing, and graphic design scenarios. All operational details, annotation formats, and benchmarking routines are specified for transparent reproduction (&&&1FLASH1&&&).