TeleStyle: Content-Preserving Style Transfer

Updated 4 February 2026

TeleStyle is a content-preserving image and video style transfer model that disentangles style cues to maintain precise content details.
It employs lightweight LoRA modules within a DiT backbone and a multi-stage curriculum learning framework to achieve state-of-the-art performance.
The model supports both image and video stylization, ensuring high temporal consistency and enhanced aesthetic quality in outputs.

TeleStyle is a content-preserving image and video style transfer model that operates by generating stylized outputs based on paired content and style references. The model addresses the central challenge in Diffusion Transformers (DiTs)—the entanglement of content and style information in their latent representations—by isolating and routing style cues while maintaining precise content fidelity. TeleStyle is implemented as a lightweight extension of Qwen-Image-Edit, leveraging robust content retention and effective style modulation, and is trained on a hybrid dataset of curated and synthetic triplets using a multi-stage curriculum continual learning framework. Both image and video stylization are supported, with a specialized video-to-video module ensuring high temporal consistency. TeleStyle achieves state-of-the-art results across quantitative and qualitative benchmarks for style similarity, content constancy, and perceptual aesthetics (Zhang et al., 28 Jan 2026).

1. Model Architecture and Disentanglement

TeleStyle’s architecture builds upon the Qwen-Image-Edit transformer, which itself employs an MMDiT-based Diffusion Transformer backbone augmented with Multi-Scale Rotary Positional Embeddings (MS-RoPE) to facilitate processing of multiple reference images. The model ingests a content reference image and a style reference image, each processed via distinct “patch embedder” networks—lightweight convolution-projection modules that encode the input images to token sequences $Z_{\rm content}, Z_{\rm style} \in \mathbb R^{N\times d}$ . These are concatenated channel-wise with the diffusion latent variable $x_t$ and any accompanying (potentially empty) text tokens and supplied to a stack of $N$ DiT blocks.

A central mechanism in TeleStyle is the use of low-rank adaptation (LoRA) modules (rank = 32) inserted into cross-attention and feed-forward layers. These adapters focus style transfer learning within the velocity prediction head $v_\theta(\cdot)$ , leaving the broader content-preservation pathway of the frozen base model undisturbed. This enables effective disentanglement of style and content features, a core limitation in prior DiT-based stylization frameworks.

For video stylization, TeleStyle extends the Wan2.1-1.3B DiT backbone (as used in FullDiT), adopting a similar dual-patch-embedder interface. The positional encoding scheme assigns temporal index 0 to style anchors (the style reference or stylized key frame) and increments through subsequent source video frames (indices $1,\ldots,T-1$ ), allowing the Transformer’s learned positional dynamics to propagate style coherently across time.

2. Dataset Construction and Triplet Synthesis

Training robust style transfer models requires diverse and well-matched triplets of content, style, and result images. TeleStyle’s training data comprises:

Clean (Curated) Triplets: $D_{\rm collected}$ —drawn from sources including OmniConsistency, GPT-4o-generated examples, and manually vetted LoRA community outputs. Following intensive manual filtering, this dataset yields 300,000 triplets covering 30 distinct artistic styles, such as oil, watercolor, and ukiyo-e.
Noisy (Synthetic) Triplets: $D_{\rm synthetic}$ —1 million triplets synthesized via a reverse pipeline starting from an in-the-wild stylized target image $I_{\rm target}$ . A photorealistic content reference $I_{\rm content}$ is generated through FLUX, and the corresponding style reference $I_{\rm style}$ is extracted using the CDST method and DINOv2 descriptors. Triplets are completed with randomly sampled textual prompts.

Combined, these datasets encompass thousands of style clusters, spanning classical, modern, and digital genres. During preprocessing, content references are aspect-ratio preserved to a minimum edge of 1024 pixels, while style references are center-cropped to squares.

3. Curriculum Continual Learning Paradigm

TeleStyle leverages a curriculum continual learning framework to maximize both style generalization and content fidelity through three sequential training stages (denoted as LoRA weightsets $x_t$ 0, $x_t$ 1, $x_t$ 2):

Stage 1: Capability Activation—LoRA parameters $x_t$ 3 are trained on the full collected dataset $x_t$ 4 to acquire general style transfer ability.
Stage 2: Content Fidelity Refinement—The network, initialized from $x_t$ 5, is fine-tuned on a reweighted subset $x_t$ 6 favoring high-fidelity content preservation (notably facial character, fine structures) to produce $x_t$ 7.
Stage 3: Robust Generalization—A mixture dataset $x_t$ 8, formed by blending $x_t$ 9 with approximately 5% of $N$ 0, is used to further train from $N$ 1, yielding $N$ 2 with improved cross-domain style generalization.

The learning objective in each phase is a rectified flow-matching loss:

$N$ 3

where $N$ 4 is the noisy latent at time $N$ 5, $N$ 6 is the target latent, $N$ 7 is a fixed prompt embedding describing the style transfer task, and $N$ 8 is the predicted velocity field. Optimization is performed via AdamW with decoupled weight decay.

4. Video-to-Video Stylization and Temporal Consistency

The TeleStyle-Video module begins with a stylized key frame ( $N$ 9) alongside source video frames $v_\theta(\cdot)$ 0. Separate patch embedders generate feature tokens for the style and each video frame, which—together with corresponding noisy latents—are processed by the DiT architecture. Positional encoding assigns index 0 to the style and incrementally to the temporally ordered frames, anchoring the style at the first frame and guiding propagation.

Temporal consistency is enforced via a flow-matching loss applied between clean stylized video $v_\theta(\cdot)$ 1 and noise $v_\theta(\cdot)$ 2, using a linear interpolation $v_\theta(\cdot)$ 3:

$v_\theta(\cdot)$ 4

This loss ensures smooth transitions between frames without requiring explicit optical flow estimation or test-time fine-tuning, preserving local and global stylistic coherence throughout temporal sequences.

5. Benchmarks, Metrics, and Quantitative Performance

TeleStyle is evaluated across three principal dimensions:

Style Similarity: Measured by the CSD Score ( $v_\theta(\cdot)$ 5, higher = better style match).
Aesthetic Quality: Assessed with the LAION Aesthetic Predictor ( $v_\theta(\cdot)$ 6, higher = more pleasing).
Content Preservation: Quantified via a thresholded CPC Score:

$v_\theta(\cdot)$ 7

This metric penalizes degenerate outputs with little or no style transfer.

A summary of benchmark performance is presented below:

Method	CSD ↑	[email protected] ↑	[email protected]:0.9 ↑	Aesthetics ↑
CSGO	0.535	0.379	0.224	5.969
DreamO	0.402	0.193	0.102	6.149
TeleStyle	0.577	0.441	0.304	6.317

TeleStyle achieves a 7.8% improvement in style similarity over CSGO, with CPC content scores increasing by 16–20%, and demonstrates superior aesthetic ratings compared to previous DiT-based stylizers. Qualitative analysis confirms TeleStyle’s strong retention of edges, object shapes, and intricate textures across diverse, including previously unseen, styles.

6. Training Procedures and Implementation Specifics

Key training and implementation parameters are as follows:

TeleStyle-Image:
- LoRA rank: 32
- Base model: Qwen-Image-Edit-2509
- Gradient checkpointing enabled
- Minimum image edge: 1024 pixels
- Hardware: 4 × NVIDIA H100 GPUs, batch size 1/GPU, learning rate 1e-4, 100,000–200,000 LoRA updates per stage
TeleStyle-Video:
- Backbone: Wan2.1-1.3B
- Training data: synthetic set plus internal filtered clips (filtered via CLIP-based motion assessment)
- Hardware: 8 × NVIDIA H100 GPUs, batch size 4/GPU, learning rate 1e-5, ~500,000 steps
Data Augmentation: Random prompt sampling, random crop/resize of style references, and standard diffusion time schedules
Inference: Content reference’s aspect ratio is preserved, style reference resized to a square of side $v_\theta(\cdot)$ 8. The default prompt configuration yields the best stability.

7. Significance and Applications

TeleStyle demonstrates that lightweight, LoRA-driven adaptation atop a robust DiT backbone—combined with curriculum-based exposure to clean and synthetic style-content triplets—produces highly generalizable and efficient style transfer in both images and videos. The ability to preserve content fidelity while enabling strong style generalization, coupled with minimal computational overhead and consistent video stylization, positions TeleStyle as an advance in cross-modal stylization research domains. The availability of the codebase and pre-trained models further facilitates investigation, adoption, and extension in style transfer and creative AI workflows (Zhang et al., 28 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

TeleStyle: Content-Preserving Style Transfer in Images and Videos (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TeleStyle.