Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniText-Bench: 1D Controllable Image-Text Testbed

Updated 24 January 2026
  • The testbed presents a controllable framework for fine-grained text insertion, removal, and editing using diffusion-based models with advanced attention control.
  • It employs precise annotation protocols, including input masks and style references, to ensure reproducible evaluations and benchmark diverse text-image tasks.
  • Standardized metrics such as PSNR, MS-SSIM, and FID rigorously assess both content accuracy and style fidelity in manipulated text regions.

A controllable 1D image-text testbed, exemplified by the OmniText-Bench protocol, provides a structured experimental environment for research in controllable text-image manipulation (TIM). This testbed is designed to evaluate methods handling fine-grained, spatially localized text insertion, removal, editing, style transfer, and more, using precise annotation schemas and standardized evaluation routines. The architecture centers on diffusion-based generative models augmented for advanced attention manipulation and latent optimization. It includes task pipelines for developing, benchmarking, and comparing TIM systems under unified and reproducible conditions (&&&1FLASH1&&&).

1. Dataset Structure and Annotation Protocol

Each sample within the OmniText-Bench testbed consists of four primary components:

  • Input Image PRESERVED_PLACEHOLDER_1FLASH1^: A 512×512512 \times 512 RGB image or cropped region that may contain existing text relevant to the specified TIM operation.
  • Target Mask (M)(M): A binary mask (M(p)=1M(p) = 1 for pixels pp to be removed, edited, or inserted; $0$ otherwise), registered to the input image to identify operative regions.
  • Target Text (T)(T): The ground-truth string, e.g., "1FLASH1 specifying the intended textual content of the region of interest.
  • Style Reference (Iref,mref)(I_{\mathrm{ref}}, m_{\mathrm{ref}}): Optional. An image and mask denoting the region(s) whose text style (font, color, stroke) serve as a target for style-aware manipulation. For tasks without explicit style transfer, (Iref,mref)=(I,M)(I_{\mathrm{ref}}, m_{\mathrm{ref}}) = (I, M).

Annotations are organized by sample directory, with files input.png (image), mask.png (mask), text.txt (target text), and—if necessary—ref.png and ref_mask.png for style reference.

Task mappings to these elements are as follows:

Task Input II Target 512×512512 \times 5121FLASH1^ Mask 512×512512 \times 5121 Style Ref 512×512512 \times 5122
Text Removal Contains text "" (empty) Where text is None
Text Insertion Blank in 512×512512 \times 5123 New string Where to insert 512×512512 \times 5124
Text Editing Old text in 512×512512 \times 5125 New string Where to edit 512×512512 \times 5126
Text Rescaling Old text present New string 512×512512 \times 5127 512×512512 \times 5128
Text Repositioning Original text Unchanged 512×512512 \times 5129 (M)(M)1FLASH1^
Style-Based Operation As above Varies Varies (M)(M)1

This structure enables isolation and control of content and style manipulations in both input and ground-truth specifications (&&&1FLASH1&&&).

2. Evaluation Metrics

Evaluation is conducted using standardized, image-informed and text-informed metrics:

  • Text Removal: Assessed over the full (M)(M)2 output using
    • PSNR (Peak Signal-to-Noise Ratio, (M)(M)3)
    • MS-SSIM ((M)(M)4, (M)(M)5)
    • FID (Fréchet Inception Distance, (M)(M)6)
  • Text Insertion, Editing, Rescaling, Repositioning: Assessed on cropped regions defined by (M)(M)7 or (M)(M)8, using
    • Content Accuracy: Word-level accuracy (ACC, (M)(M)9, M(p)=1M(p) = 11FLASH1), Normalized Edit Distance (NED, M(p)=1M(p) = 11) via scene-text OCR.
    • Style Fidelity: Pixel-wise MSE (M(p)=1M(p) = 12, M(p)=1M(p) = 13), PSNR (M(p)=1M(p) = 14), MS-SSIM (M(p)=1M(p) = 15, M(p)=1M(p) = 16), FID (M(p)=1M(p) = 17).

For style-aware tasks, fidelity metrics are computed over style-referenced insertions, ensuring that both semantic content and style attributes are jointly evaluated (&&&1FLASH1&&&).

3. Model Architecture and Attention Mechanisms

OmniText builds upon a pre-trained, latently diffused text-inpainting U-Net backbone (TextDiff-2), incorporating two dedicated operational modules:

  • Text Removal (TR) performed via attention modulation at sampling time:

    • Self-Attention Inversion (SAI): Inverts self-attention activations over masked regions in early sampling steps to suppress residual focus on removed text:

    M(p)=1M(p) = 18

    where M(p)=1M(p) = 19 is the self-attention matrix for U-Net decoder block pp1FLASH1. - Cross-Attention Reassignment (CAR): Constrains cross-attention to align only with start/end tokens:

    pp1

    SAI is applied during the first pp2 of sampling steps; CAR is applied at all steps.

  • Controllable Inpainting (CI) enabled by latent-space optimization:

    1. Grid construction: pp3 where pp4 is a possibly shrunk mask for editing.
    2. Latent optimization proceeds during the first pp5 of diffusion steps. At each step, losses are computed and gradients are backpropagated to pp6.

The combined use of direct attention manipulation and optimization-driven inpainting delivers fine-grained control over both appearance and semantics in the manipulated text regions (&&&1FLASH1&&&).

4. Latent Optimization: Loss Formulations and Update Procedure

Two novel loss functions stabilize and guide latent inpainting:

  • Cross-Attention Content Loss (pp7): Ensures content-accurate rendering of each character by maximizing cross-attention activations for each target token over sub-mask regions. The loss for each character sub-mask pp8 and token pp9 is formulated:

$0$1FLASH1^

employing Focal Loss with $0$1.

  • Self-Attention Style Loss ($0$2): Aligns the distribution of self-attention within the masked region to a normalized reference mask $0$3 by minimizing the KL divergence:

$0$4

The total loss is $0$5 (default weights: $0$6). Latent variables $0$7 are updated by Adam with a learning rate of $0$8 during the specified optimization stages ($0$9, (T)(T)1FLASH1, (T)(T)1 along the diffusion timeline) (&&&1FLASH1&&&).

5. End-to-End Testbed Instantiation and Usage

The testbed runtime consists of well-defined pipelines and hyperparameters:

  • Sampling steps: 21FLASH1^ total (TR + CI)

  • SAI steps: first 11FLASH1^ (for TR tasks)
  • Latent optimization: performed at (T)(T)2, (T)(T)3, (T)(T)4 of the timeline
  • Adam LR: (T)(T)5
  • Loss weights: (T)(T)6, (T)(T)7

Reference pseudocode specifies the operational logic for all task types, leveraging a function such as denoise_step(z, mask, grid, token_embed) to carry out diffusion steps with attention hooks. The pipeline branches depend on the task—TR for erasure, CI for insertion/edit/transfer—using the provided sample masks, text, and style references. Example CLI commands clarify reproducibility, such as:

(T)(T)8

Generated outputs are benchmarked against ground-truth data using the metrics and crop protocols stipulated above (&&&1FLASH1&&&).

6. Research Significance and Applications

OmniText-Bench establishes a versatile and controllable 1D image-text testbed, directly enabling universal TIM research and development:

  • Broader TIM Applicability: Encompasses removal, insertion, arbitrary editing, geometric and stylistic variants.
  • Fine-Grained Style Control: Through referential guidance, it supports tasks involving heterogeneous font, stroke, and color transfer.
  • Unified Metrics and Protocols: Enables rigorous comparison across methods, including both generalist and specialist architectures.

A plausible implication is that this design paradigm facilitates rapid iteration for both foundational TIM models and downstream applied research in real-world signage, document editing, and graphic design scenarios. All operational details, annotation formats, and benchmarking routines are specified for transparent reproduction (&&&1FLASH1&&&).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Controllable 1D Image-Text Testbed.