TeleStyle: Content-Preserving Style Transfer
- TeleStyle is a content-preserving image and video style transfer model that disentangles style cues to maintain precise content details.
- It employs lightweight LoRA modules within a DiT backbone and a multi-stage curriculum learning framework to achieve state-of-the-art performance.
- The model supports both image and video stylization, ensuring high temporal consistency and enhanced aesthetic quality in outputs.
TeleStyle is a content-preserving image and video style transfer model that operates by generating stylized outputs based on paired content and style references. The model addresses the central challenge in Diffusion Transformers (DiTs)—the entanglement of content and style information in their latent representations—by isolating and routing style cues while maintaining precise content fidelity. TeleStyle is implemented as a lightweight extension of Qwen-Image-Edit, leveraging robust content retention and effective style modulation, and is trained on a hybrid dataset of curated and synthetic triplets using a multi-stage curriculum continual learning framework. Both image and video stylization are supported, with a specialized video-to-video module ensuring high temporal consistency. TeleStyle achieves state-of-the-art results across quantitative and qualitative benchmarks for style similarity, content constancy, and perceptual aesthetics (Zhang et al., 28 Jan 2026).
1. Model Architecture and Disentanglement
TeleStyle’s architecture builds upon the Qwen-Image-Edit transformer, which itself employs an MMDiT-based Diffusion Transformer backbone augmented with Multi-Scale Rotary Positional Embeddings (MS-RoPE) to facilitate processing of multiple reference images. The model ingests a content reference image and a style reference image, each processed via distinct “patch embedder” networks—lightweight convolution-projection modules that encode the input images to token sequences . These are concatenated channel-wise with the diffusion latent variable and any accompanying (potentially empty) text tokens and supplied to a stack of DiT blocks.
A central mechanism in TeleStyle is the use of low-rank adaptation (LoRA) modules (rank = 32) inserted into cross-attention and feed-forward layers. These adapters focus style transfer learning within the velocity prediction head , leaving the broader content-preservation pathway of the frozen base model undisturbed. This enables effective disentanglement of style and content features, a core limitation in prior DiT-based stylization frameworks.
For video stylization, TeleStyle extends the Wan2.1-1.3B DiT backbone (as used in FullDiT), adopting a similar dual-patch-embedder interface. The positional encoding scheme assigns temporal index 0 to style anchors (the style reference or stylized key frame) and increments through subsequent source video frames (indices ), allowing the Transformer’s learned positional dynamics to propagate style coherently across time.
2. Dataset Construction and Triplet Synthesis
Training robust style transfer models requires diverse and well-matched triplets of content, style, and result images. TeleStyle’s training data comprises:
- Clean (Curated) Triplets: —drawn from sources including OmniConsistency, GPT-4o-generated examples, and manually vetted LoRA community outputs. Following intensive manual filtering, this dataset yields 300,000 triplets covering 30 distinct artistic styles, such as oil, watercolor, and ukiyo-e.
- Noisy (Synthetic) Triplets: —1 million triplets synthesized via a reverse pipeline starting from an in-the-wild stylized target image . A photorealistic content reference is generated through FLUX, and the corresponding style reference is extracted using the CDST method and DINOv2 descriptors. Triplets are completed with randomly sampled textual prompts.
Combined, these datasets encompass thousands of style clusters, spanning classical, modern, and digital genres. During preprocessing, content references are aspect-ratio preserved to a minimum edge of 1024 pixels, while style references are center-cropped to squares.
3. Curriculum Continual Learning Paradigm
TeleStyle leverages a curriculum continual learning framework to maximize both style generalization and content fidelity through three sequential training stages (denoted as LoRA weightsets , , ):
- Stage 1: Capability Activation—LoRA parameters are trained on the full collected dataset to acquire general style transfer ability.
- Stage 2: Content Fidelity Refinement—The network, initialized from , is fine-tuned on a reweighted subset favoring high-fidelity content preservation (notably facial character, fine structures) to produce .
- Stage 3: Robust Generalization—A mixture dataset , formed by blending with approximately 5% of , is used to further train from , yielding with improved cross-domain style generalization.
The learning objective in each phase is a rectified flow-matching loss:
where is the noisy latent at time , is the target latent, is a fixed prompt embedding describing the style transfer task, and is the predicted velocity field. Optimization is performed via AdamW with decoupled weight decay.
4. Video-to-Video Stylization and Temporal Consistency
The TeleStyle-Video module begins with a stylized key frame () alongside source video frames . Separate patch embedders generate feature tokens for the style and each video frame, which—together with corresponding noisy latents—are processed by the DiT architecture. Positional encoding assigns index 0 to the style and incrementally to the temporally ordered frames, anchoring the style at the first frame and guiding propagation.
Temporal consistency is enforced via a flow-matching loss applied between clean stylized video and noise , using a linear interpolation :
This loss ensures smooth transitions between frames without requiring explicit optical flow estimation or test-time fine-tuning, preserving local and global stylistic coherence throughout temporal sequences.
5. Benchmarks, Metrics, and Quantitative Performance
TeleStyle is evaluated across three principal dimensions:
- Style Similarity: Measured by the CSD Score (, higher = better style match).
- Aesthetic Quality: Assessed with the LAION Aesthetic Predictor (, higher = more pleasing).
- Content Preservation: Quantified via a thresholded CPC Score:
This metric penalizes degenerate outputs with little or no style transfer.
A summary of benchmark performance is presented below:
| Method | CSD ↑ | [email protected] ↑ | [email protected]:0.9 ↑ | Aesthetics ↑ |
|---|---|---|---|---|
| CSGO | 0.535 | 0.379 | 0.224 | 5.969 |
| DreamO | 0.402 | 0.193 | 0.102 | 6.149 |
| TeleStyle | 0.577 | 0.441 | 0.304 | 6.317 |
TeleStyle achieves a 7.8% improvement in style similarity over CSGO, with CPC content scores increasing by 16–20%, and demonstrates superior aesthetic ratings compared to previous DiT-based stylizers. Qualitative analysis confirms TeleStyle’s strong retention of edges, object shapes, and intricate textures across diverse, including previously unseen, styles.
6. Training Procedures and Implementation Specifics
Key training and implementation parameters are as follows:
- TeleStyle-Image:
- LoRA rank: 32
- Base model: Qwen-Image-Edit-2509
- Gradient checkpointing enabled
- Minimum image edge: 1024 pixels
- Hardware: 4 × NVIDIA H100 GPUs, batch size 1/GPU, learning rate 1e-4, 100,000–200,000 LoRA updates per stage
- TeleStyle-Video:
- Backbone: Wan2.1-1.3B
- Training data: synthetic set plus internal filtered clips (filtered via CLIP-based motion assessment)
- Hardware: 8 × NVIDIA H100 GPUs, batch size 4/GPU, learning rate 1e-5, ~500,000 steps
- Data Augmentation: Random prompt sampling, random crop/resize of style references, and standard diffusion time schedules
- Inference: Content reference’s aspect ratio is preserved, style reference resized to a square of side . The default prompt configuration yields the best stability.
7. Significance and Applications
TeleStyle demonstrates that lightweight, LoRA-driven adaptation atop a robust DiT backbone—combined with curriculum-based exposure to clean and synthetic style-content triplets—produces highly generalizable and efficient style transfer in both images and videos. The ability to preserve content fidelity while enabling strong style generalization, coupled with minimal computational overhead and consistent video stylization, positions TeleStyle as an advance in cross-modal stylization research domains. The availability of the codebase and pre-trained models further facilitates investigation, adoption, and extension in style transfer and creative AI workflows (Zhang et al., 28 Jan 2026).