Papers
Topics
Authors
Recent
Search
2000 character limit reached

TeleStyle: Content-Preserving Style Transfer

Updated 4 February 2026
  • TeleStyle is a content-preserving image and video style transfer model that disentangles style cues to maintain precise content details.
  • It employs lightweight LoRA modules within a DiT backbone and a multi-stage curriculum learning framework to achieve state-of-the-art performance.
  • The model supports both image and video stylization, ensuring high temporal consistency and enhanced aesthetic quality in outputs.

TeleStyle is a content-preserving image and video style transfer model that operates by generating stylized outputs based on paired content and style references. The model addresses the central challenge in Diffusion Transformers (DiTs)—the entanglement of content and style information in their latent representations—by isolating and routing style cues while maintaining precise content fidelity. TeleStyle is implemented as a lightweight extension of Qwen-Image-Edit, leveraging robust content retention and effective style modulation, and is trained on a hybrid dataset of curated and synthetic triplets using a multi-stage curriculum continual learning framework. Both image and video stylization are supported, with a specialized video-to-video module ensuring high temporal consistency. TeleStyle achieves state-of-the-art results across quantitative and qualitative benchmarks for style similarity, content constancy, and perceptual aesthetics (Zhang et al., 28 Jan 2026).

1. Model Architecture and Disentanglement

TeleStyle’s architecture builds upon the Qwen-Image-Edit transformer, which itself employs an MMDiT-based Diffusion Transformer backbone augmented with Multi-Scale Rotary Positional Embeddings (MS-RoPE) to facilitate processing of multiple reference images. The model ingests a content reference image and a style reference image, each processed via distinct “patch embedder” networks—lightweight convolution-projection modules that encode the input images to token sequences Zcontent,ZstyleRN×dZ_{\rm content}, Z_{\rm style} \in \mathbb R^{N\times d}. These are concatenated channel-wise with the diffusion latent variable xtx_t and any accompanying (potentially empty) text tokens and supplied to a stack of NN DiT blocks.

A central mechanism in TeleStyle is the use of low-rank adaptation (LoRA) modules (rank = 32) inserted into cross-attention and feed-forward layers. These adapters focus style transfer learning within the velocity prediction head vθ()v_\theta(\cdot), leaving the broader content-preservation pathway of the frozen base model undisturbed. This enables effective disentanglement of style and content features, a core limitation in prior DiT-based stylization frameworks.

For video stylization, TeleStyle extends the Wan2.1-1.3B DiT backbone (as used in FullDiT), adopting a similar dual-patch-embedder interface. The positional encoding scheme assigns temporal index 0 to style anchors (the style reference or stylized key frame) and increments through subsequent source video frames (indices 1,,T11,\ldots,T-1), allowing the Transformer’s learned positional dynamics to propagate style coherently across time.

2. Dataset Construction and Triplet Synthesis

Training robust style transfer models requires diverse and well-matched triplets of content, style, and result images. TeleStyle’s training data comprises:

  • Clean (Curated) Triplets: DcollectedD_{\rm collected}—drawn from sources including OmniConsistency, GPT-4o-generated examples, and manually vetted LoRA community outputs. Following intensive manual filtering, this dataset yields 300,000 triplets covering 30 distinct artistic styles, such as oil, watercolor, and ukiyo-e.
  • Noisy (Synthetic) Triplets: DsyntheticD_{\rm synthetic}—1 million triplets synthesized via a reverse pipeline starting from an in-the-wild stylized target image ItargetI_{\rm target}. A photorealistic content reference IcontentI_{\rm content} is generated through FLUX, and the corresponding style reference IstyleI_{\rm style} is extracted using the CDST method and DINOv2 descriptors. Triplets are completed with randomly sampled textual prompts.

Combined, these datasets encompass thousands of style clusters, spanning classical, modern, and digital genres. During preprocessing, content references are aspect-ratio preserved to a minimum edge of 1024 pixels, while style references are center-cropped to squares.

3. Curriculum Continual Learning Paradigm

TeleStyle leverages a curriculum continual learning framework to maximize both style generalization and content fidelity through three sequential training stages (denoted as LoRA weightsets Q1Q_1, Q2Q_2, Q3Q_3):

  1. Stage 1: Capability Activation—LoRA parameters Q1Q_1 are trained on the full collected dataset D1=DcollectedD_1 = D_{\rm collected} to acquire general style transfer ability.
  2. Stage 2: Content Fidelity Refinement—The network, initialized from Q1Q_1, is fine-tuned on a reweighted subset D2D_2 favoring high-fidelity content preservation (notably facial character, fine structures) to produce Q2Q_2.
  3. Stage 3: Robust Generalization—A mixture dataset D3D_3, formed by blending D2D_2 with approximately 5% of DsyntheticD_{\rm synthetic}, is used to further train from Q2Q_2, yielding Q3Q_3 with improved cross-domain style generalization.

The learning objective in each phase is a rectified flow-matching loss:

Lflow=Et,ϵN(0,I)vθ(xt,t,cstyle,ccontent,cP)(ϵx0)22L_{\rm flow} = \mathbb E_{t,\epsilon \sim \mathcal N(0,I)} \left\| v_\theta(x_t, t, c_{\rm style}, c_{\rm content}, c_P) - (\epsilon - x_0) \right\|_2^2

where xtx_t is the noisy latent at time tt, x0x_0 is the target latent, cPc_P is a fixed prompt embedding describing the style transfer task, and vθ()v_\theta(\cdot) is the predicted velocity field. Optimization is performed via AdamW with decoupled weight decay.

4. Video-to-Video Stylization and Temporal Consistency

The TeleStyle-Video module begins with a stylized key frame (v^0Istyle\hat v_0 \approx I_{\rm style}) alongside source video frames {v1,,vT1}\{v_1, \ldots, v_{T-1}\}. Separate patch embedders generate feature tokens for the style and each video frame, which—together with corresponding noisy latents—are processed by the DiT architecture. Positional encoding assigns index 0 to the style and incrementally to the temporally ordered frames, anchoring the style at the first frame and guiding propagation.

Temporal consistency is enforced via a flow-matching loss applied between clean stylized video x0x_0 and noise x1x_1, using a linear interpolation xt=(1t)x0+tx1x_t = (1-t)x_0 + t x_1:

LFM=Et,x0,x1vθ(xt,t)(x1x0)2\mathcal L_{\rm FM} = \mathbb E_{t,x_0,x_1} \left\| v_\theta(x_t, t) - (x_1 - x_0) \right\|^2

This loss ensures smooth transitions between frames without requiring explicit optical flow estimation or test-time fine-tuning, preserving local and global stylistic coherence throughout temporal sequences.

5. Benchmarks, Metrics, and Quantitative Performance

TeleStyle is evaluated across three principal dimensions:

  • Style Similarity: Measured by the CSD Score (\uparrow, higher = better style match).
  • Aesthetic Quality: Assessed with the LAION Aesthetic Predictor (\uparrow, higher = more pleasing).
  • Content Preservation: Quantified via a thresholded CPC Score:

CPC@τ={CLIP(Ires,Tvlm),if CSD(Ires,Istyle)τ 0,otherwise\mathrm{CPC}@\tau = \begin{cases} \mathrm{CLIP}(I_{\rm res}, T_{\rm vlm}), & \text{if } \mathrm{CSD}(I_{\rm res}, I_{\rm style}) \geq \tau \ 0, & \text{otherwise} \end{cases}

This metric penalizes degenerate outputs with little or no style transfer.

A summary of benchmark performance is presented below:

Method CSD ↑ [email protected] ↑ [email protected]:0.9 ↑ Aesthetics ↑
CSGO 0.535 0.379 0.224 5.969
DreamO 0.402 0.193 0.102 6.149
TeleStyle 0.577 0.441 0.304 6.317

TeleStyle achieves a 7.8% improvement in style similarity over CSGO, with CPC content scores increasing by 16–20%, and demonstrates superior aesthetic ratings compared to previous DiT-based stylizers. Qualitative analysis confirms TeleStyle’s strong retention of edges, object shapes, and intricate textures across diverse, including previously unseen, styles.

6. Training Procedures and Implementation Specifics

Key training and implementation parameters are as follows:

  • TeleStyle-Image:
    • LoRA rank: 32
    • Base model: Qwen-Image-Edit-2509
    • Gradient checkpointing enabled
    • Minimum image edge: 1024 pixels
    • Hardware: 4 × NVIDIA H100 GPUs, batch size 1/GPU, learning rate 1e-4, 100,000–200,000 LoRA updates per stage
  • TeleStyle-Video:
    • Backbone: Wan2.1-1.3B
    • Training data: synthetic set plus internal filtered clips (filtered via CLIP-based motion assessment)
    • Hardware: 8 × NVIDIA H100 GPUs, batch size 4/GPU, learning rate 1e-5, ~500,000 steps
  • Data Augmentation: Random prompt sampling, random crop/resize of style references, and standard diffusion time schedules
  • Inference: Content reference’s aspect ratio is preserved, style reference resized to a square of side min(H,W)\min(H,W). The default prompt configuration yields the best stability.

7. Significance and Applications

TeleStyle demonstrates that lightweight, LoRA-driven adaptation atop a robust DiT backbone—combined with curriculum-based exposure to clean and synthetic style-content triplets—produces highly generalizable and efficient style transfer in both images and videos. The ability to preserve content fidelity while enabling strong style generalization, coupled with minimal computational overhead and consistent video stylization, positions TeleStyle as an advance in cross-modal stylization research domains. The availability of the codebase and pre-trained models further facilitates investigation, adoption, and extension in style transfer and creative AI workflows (Zhang et al., 28 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TeleStyle.