UI-Styler: Ultrasound Image Style Transfer

Updated 28 November 2025

UI-Styler is a specialized framework for ultrasound image style transfer that employs dual-level stylization and class-aware prompting to address cross-device diagnostic variances.
It integrates a three-stage pipeline using Vision Transformer encoders, pattern-matching modules, and prompt-based supervision to ensure semantic consistency.
Experimental results demonstrate improved classification and segmentation performance, outperforming conventional unpaired image translation methods.

UI-Styler is a specialized framework for ultrasound image style transfer designed to address cross-device domain shifts in medical imaging diagnostics, with a distinct focus on class-aware semantic alignment under a black-box inference network constraint (Do-Tran et al., 21 Nov 2025). Its architecture introduces novel dual-level stylization and prompt-based supervision, delivering substantive improvements over prior unpaired image translation (UIT) approaches in both distributional alignment and downstream diagnostic accuracy.

1. Problem Setting and Motivation

Diagnostic ultrasound images exhibit substantial inter-device appearance variation due to hardware differences, acquisition protocols, and preprocessing. Such domain shifts frequently degrade the performance of machine learning models, particularly when these models are treated as frozen black-boxes—i.e., their parameters and outputs are inaccessible other than final predictions. Standard UIT techniques seeking to match appearance distributions across domains do not account for semantic consistency with downstream tasks, often leading to misaligned class-content mappings. UI-Styler directly addresses this by enforcing class-specific semantic alignment in the image translation process through the use of class-aware prompts extracted from pseudo labels produced by the black-box inference network (Do-Tran et al., 21 Nov 2025).

2. Core Architecture and Pipeline

The UI-Styler framework is structured as a three-stage pipeline:

Feature Extraction: Parallel Vision Transformer (ViT) encoders $E_s$ and $E_t$ map source and target images $x_s, x_t \in\mathbb{R}^{H \times W \times 3}$ to patch embeddings $F_s, F_t \in\mathbb{R}^{L \times d}$ , with $L = (H/P) \cdot (W/P)$ and embedding dimension $d$ .
Dual-Level Stylization:
- Pattern-Matching (PM) Module: Applies cross-attention for domain-level style injection, aligning patchwise source and target statistics. Each attention head computes a query-key-value transformation between $F_s$ and $F_t$ , outputting stylized embeddings $\widetilde{F}_{s\to t}$ .
- Class-Aware Prompting (CP) Module: Introduces a prompt bank $P \in \mathbb{R}^{C \times L \times d}$ , where each prototype $P_c$ encodes the visual characteristics associated with category $c$ . Features $\widetilde{F}_{s\to t}$ are shifted toward class-aligned prototypes via inner-product scoring and prompt addition, resulting in class-semantic stylized features $\widetilde{F}_{s\to t}^+$ .
Image Reconstruction: A lightweight decoder $D$ upsamples and maps $\widetilde{F}_{s\to t}^+$ back to the image domain, yielding the stylized source $\tilde{x}_s = D(\widetilde{F}_{s\to t}^+)$ (Do-Tran et al., 21 Nov 2025).

3. Class-Aware Prompting and Supervision

Class-aware prompting operates as follows:

Pseudo-Label Assignment: Each target image $x_t$ is pseudo-labeled using the frozen black-box network (BDM), yielding a hard class assignment $\hat{y}_t \in \{1, ..., C\}$ .
Prompt Assignment: During prompt learning, each $F_t$ is associated with its corresponding $P_{\hat{y}_t}$ . For stylized source features, the prompt scoring enforces alignment toward the same prototype.
Prompt Losses:
- Direction Loss: Encourages prompt correlation vector $a = \sigma(E_f(Z) E_p(P)^\top)$ to align with the one-hot encoding of $\hat{y}_t$ :
$\mathcal{L}_{\mathrm{dir}} = -\frac{1}{C}\sum_{c=1}^C [\hat{y}_c \log a_c + (1-\hat{y}_c)\log(1-a_c)]$ - Supervised Prompt Loss: After prompt addition and a classifier head $H$ :

$p = \mathrm{softmax}(H(F_t + P_{\hat{y}_t})) \qquad \mathcal{L}_{\mathrm{sup}} = -\sum_{c=1}^C \hat{y}_c \log p_c$

These operate in tandem to steer features toward the semantic boundary of the BDM (Do-Tran et al., 21 Nov 2025).

4. Objective Functions and Training Protocol

The overall loss for joint training is:

$\mathcal{L}_{\text{total}} = \lambda_{\text{dir}}\mathcal{L}_{\text{dir}} + \lambda_{\text{sup}}\mathcal{L}_{\text{sup}} + \lambda_{c}\mathcal{L}_c + \lambda_{s}\mathcal{L}_s$

with all $\lambda$ weights set to 1 in reported experiments. Key loss components include:

Content Loss $\mathcal{L}_c$ : Ensures source structure preservation by minimizing the $\ell_2$ distance between $E_s(x_s)$ and $E_s(\tilde{x}_s)$ .
Style Loss $\mathcal{L}_s$ : Matches Gram matrices (texture) at decoder layers for $\tilde{x}_s$ and a randomly sampled $F_t$ .
Prompt Losses $\mathcal{L}_{\text{dir}}$ , $\mathcal{L}_{\text{sup}}$ : Enforce semantic alignment as described above.

Network details: patch size $P=8$ , token dimension $d=512$ , three ViT blocks per encoder, Adam optimizer with learning rate $5\times 10^{-4}$ , batch size 8, and 50K training iterations. Decoder architecture comprises three upsampling plus convolution layers (Do-Tran et al., 21 Nov 2025).

5. Black-Box Inference Constraint

A defining feature of UI-Styler is its strict integration with a frozen downstream black-box model:

No Gradient or Logit Access: Prompt targets are derived exclusively from BDM predictions; no end-to-end gradient information or confidence scores are available.
Semantic Regularization: The only constraint to maintain diagnostic accuracy is encoded in the prompt mechanism, which guides the stylized features toward the correct BDM decision boundary.
Inference Protocol: During deployment, stylized images are directly evaluated by the same frozen BDM for classification or segmentation without any model retraining (Do-Tran et al., 21 Nov 2025).

6. Experimental Evaluation

The experimental setup utilizes four breast-ultrasound datasets—BUSBRA, BUSI, UCLM, and UDIAT—organized into 12 cross-domain adaptation tasks (pairwise, with 70/30 train/test splits). Evaluation metrics include:

Distribution Alignment: Kernel Inception Distance (KID) between stylized source and target images.
Classification Performance: Accuracy and AUC on the black-box downstream model.
Segmentation Performance: Dice coefficient and IoU using SAMUS segmentation on stylized images.

UI-Styler outperforms state-of-the-art UIT baselines (TransColor, S2WAT, Mamba-ST), attaining the lowest KID in 10 out of 12 tasks. Gains in downstream metrics are substantial: classification accuracy increases by 2–12 percentage points, Dice coefficient by 1–3 points. For instance, on UCLM→BUSI, accuracy improves from 75.0% (prior best) to 80.0%, and Dice from 77.11 to 80.22. Feature ablation isolates PM and CP contributions: PM alone reduces KID by 20–40% and boosts AUC by 5–8 points; adding CP yields an additional +3–5 points in accuracy and +1–2 in Dice. t-SNE analysis reveals PM reduces domain gap but leaves class clusters at the BDM decision boundary, while CP aligns samples more distinctly to their correct classes. Stylized images processed by UI-Styler show higher BDM confidence (median >0.8) with reduced prediction variance (Do-Tran et al., 21 Nov 2025).

7. Practical Considerations and Limitations

Key implementation considerations:

Patch-based ViT encoding ( $P=8$ ), high-dimensional token embeddings ( $d=512$ ), Adam optimizer, and convolutional upsampling decoder.
Prompts are global class prototypes; extending to multi-scale or spatially localized prompts is suggested for finer structure preservation.
Although demonstrated for binary breast-ultrasound tasks, UI-Styler generalizes to multi-class and other imaging modalities by increasing $C$ and the prompt bank's capacity (Do-Tran et al., 21 Nov 2025).

Limiting factors include potential insufficient granularity in global prompting and reliance on the pseudo-label quality of the BDM, as no true labels for the target domain are leveraged. Extensions could include multi-scale prompting and broader modality adaptation.

For related research on non-medical UI style transfer and citizen-led personalization in user interface design, see "ImagineNet: Restyling Apps Using Neural Style Transfer" (Fischer et al., 2020) and "Citizen-Led Personalization of User Interfaces" (Alves et al., 2024). However, UI-Styler remains unique in its class-aware, black-box-constrained UIT approach tailored for diagnostic imaging.