TV2TV: Unified Generative Video Framework

Updated 9 January 2026

The paper presents a novel MoT architecture that interleaves text planning and video synthesis to enhance visual quality and prompt alignment.
It employs dual towers for language and video with classifier-free guidance and ODE-based denoising to ensure temporally coherent outputs.
Empirical results demonstrate significant improvements in instruction correctness and visual fidelity, establishing TV2TV as a foundation for controllable video generation.

TV2TV is a unified generative modeling framework for complex video synthesis that decomposes video generation into an interleaved process of natural language reasoning and video frame generation. By integrating LLM reasoning with diffusion-based video generation in a Mixture-of-Transformers (MoT) architecture, TV2TV enables models to plan "in words" and "act in pixels," yielding improved visual quality, prompt alignment, and editability throughout generation. The system's inferencing mechanism and architecture allow tight coupling between controllable high-level semantics and temporally coherent video outputs, positioning TV2TV as a foundation for unified, flexible video-text generation (Han et al., 4 Dec 2025).

1. Architectural Design

TV2TV employs a MoT backbone with two parallel towers per transformer layer:

Text Tower: Implements a pretrained Llama-3 LLM, specializing in autoregressive next-token prediction.
Video Tower: Utilizes a U-Net with downsampling and upsampling paths, operating on continuous video latent variables extracted by a VAE tokenizer.

At each transformer layer, separate QKV (query/key/value) projections are computed for the text and video modalities; these attend globally over the entire interleaved sequence, before routing outputs through modality-specific feedforward networks. Intermediate representations involve linearly embedded text tokens ( $x^{txt}$ ) and video latents, with the latter combined with Gaussian noise $\epsilon$ via rectified-flow interpolation:

$x^{noisy-vid} = t \cdot x^{clean-vid} + (1 - t) \cdot \epsilon, \text{ with } t \sim \text{Logistic}(\mathcal{N}(0, 1.4^2))$

The model receives both clean and noisy video latents, enabling conditioning on the clean history for learning denoising of the current video chunk.

2. Joint Training Objectives

Training in TV2TV minimizes a weighted combination of language and video objectives:

Text Loss: Standard cross-entropy for next-token prediction:

$\mathcal{L}_{txt} = \mathrm{CE}(s^{txt}_{logits}, x^{txt})$

Video Loss: Mean squared error on predicted flow for denoising video latents:

$\mathcal{L}_{vid} = \mathrm{MSE}(v^{noisy-vid}_{pred}, [x^{clean-vid} - \epsilon])$

Combined Loss:

$\mathcal{L} = \lambda_{txt} \cdot \mathcal{L}_{txt} + \lambda_{vid} \cdot \mathcal{L}_{vid}$

where $\lambda_{txt}, \lambda_{vid}$ balance contributions.

Classifier-free guidance is incorporated via stochastic dropping of text tokens (rate $p_{txt-drop}$ ) and switching from clean to noisy video latents (rate $p_{clean-vid-flip}$ ).

3. Inference Mechanism and Interleaved Generation

TV2TV inference alternates between text and video generation, managed by a boundary token (BOF, "begin-of-frame"):

Beginning in text mode, the model autoregressively samples tokens.
Upon emission of a text token ( $x_i \in V$ ), text generation continues.
If BOF appears, TV2TV:
- Allocates a new video latent $x_i^{noisy-vid} \sim \mathcal{N}(0,I)$ .
- Executes $m$ steps of an ODE solver (e.g., Euler) to denoise this latent conditioned on the current key-value cache.
- The denoised $x_i^{clean-vid}$ is appended to the sequence; text generation resumes.
Generation halts at EOS or context limit.

A sliding-window generation strategy allows scaling beyond fixed context: half of the oldest tokens/chunks are discarded, retaining recent history for conditioning. This approach permits generation of extended video sequences.

4. Controllability and Editability

TV2TV enables fine-grained, mid-sequence control by decoupling "planning" (text) from "acting" (video):

Users can modify the textual narrative at any BOF boundary by editing injected text prompts (e.g., inserting "jump now" or "reload weapon").
The model continues video synthesis conditioned on the revised text trajectory, yielding prompt-aligned visual changes in subsequent frames.

In controlled interventions on CS:GO gameplay, this mechanism increased instruction-following accuracy from 59% (Think2V baseline) to 78% and improved visual quality scores (Han et al., 4 Dec 2025).

5. Empirical Results

TV2TV demonstrates significant advancements in both visual quality and controllability across synthetic and natural video domains:

CS:GO Experiments:
- TV2TV outperforms T2V (no text planning) in video preference (91% vs 1% ties) and Think2V (plan-then-generate) in long sequences.
- Controllability: 78% instruction correctness versus 59% for Think2V under action interventions.
Real-world Sports Videos:
- TV2TV surpasses Cosmos-2 and matches or trails other baselines (MAGI-1, WAN-2.2 5B) in measures of prompt alignment, fidelity, and holistic preference.
- Against T2V and Think2V, TV2TV achieves holistic preference improvements of +19% and +12%, and prompt alignment improvements of +20 and +12 points, respectively.

These gains are attributed to the offloading of high-level semantic planning to the text tower, leading to enhanced alignment between user prompts and video content, and to the interleaving mechanism that supports dynamic, user-driven interventions.

TV2TV’s dependency on synthetic vision-LLM (VLM) captions, particularly in open-domain datasets, can result in coarse segmentations (~1.9 seconds) and occasional hallucinations.
The inference procedure incurs higher computational cost due to alternating text decoding and ODE-based denoising of video latents.
Compared to frameworks such as UniVG (Ruan et al., 2024), which unifies text-to-video and image-to-video via latent diffusion and multi-condition cross attention, TV2TV uniquely addresses fine-grained temporal reasoning by interleaving text planning throughout generation, not just as an initial prompt.
Both TV2TV and UniVG support classifier-free guidance and enable multi-modal conditional video generation, but TV2TV is singular in combining flexible, interleaved language-driven planning with pixel-level synthesis in an autoregressive loop.

7. Prospects and Future Research

Future work on TV2TV involves:

Developing denser, more accurate segment-level captions, potentially through real-time VLM tagging.
Optimizing diffusion solvers for greater inference efficiency.
Extending the MoT routing mechanism to additional modalities (e.g., audio, dialogue).
Creating end-to-end interactive user interfaces for live textual intervention and video editing.

TV2TV represents a step toward unified, foundation models for video generation that allow models to "reason in words and act in pixels," foreshadowing a new class of controllable, open-ended generative systems for complex sequential domains (Han et al., 4 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

TV2TV: A Unified Framework for Interleaved Language and Video Generation (2025)

UniVG: Towards UNIfied-modal Video Generation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TV2TV: A Unified Generative Framework.