TV2TV: Unified Generative Video Framework
- The paper presents a novel MoT architecture that interleaves text planning and video synthesis to enhance visual quality and prompt alignment.
- It employs dual towers for language and video with classifier-free guidance and ODE-based denoising to ensure temporally coherent outputs.
- Empirical results demonstrate significant improvements in instruction correctness and visual fidelity, establishing TV2TV as a foundation for controllable video generation.
TV2TV is a unified generative modeling framework for complex video synthesis that decomposes video generation into an interleaved process of natural language reasoning and video frame generation. By integrating LLM reasoning with diffusion-based video generation in a Mixture-of-Transformers (MoT) architecture, TV2TV enables models to plan "in words" and "act in pixels," yielding improved visual quality, prompt alignment, and editability throughout generation. The system's inferencing mechanism and architecture allow tight coupling between controllable high-level semantics and temporally coherent video outputs, positioning TV2TV as a foundation for unified, flexible video-text generation (Han et al., 4 Dec 2025).
1. Architectural Design
TV2TV employs a MoT backbone with two parallel towers per transformer layer:
- Text Tower: Implements a pretrained Llama-3 LLM, specializing in autoregressive next-token prediction.
- Video Tower: Utilizes a U-Net with downsampling and upsampling paths, operating on continuous video latent variables extracted by a VAE tokenizer.
At each transformer layer, separate QKV (query/key/value) projections are computed for the text and video modalities; these attend globally over the entire interleaved sequence, before routing outputs through modality-specific feedforward networks. Intermediate representations involve linearly embedded text tokens () and video latents, with the latter combined with Gaussian noise via rectified-flow interpolation:
The model receives both clean and noisy video latents, enabling conditioning on the clean history for learning denoising of the current video chunk.
2. Joint Training Objectives
Training in TV2TV minimizes a weighted combination of language and video objectives:
- Text Loss: Standard cross-entropy for next-token prediction:
- Video Loss: Mean squared error on predicted flow for denoising video latents:
- Combined Loss:
where balance contributions.
Classifier-free guidance is incorporated via stochastic dropping of text tokens (rate ) and switching from clean to noisy video latents (rate ).
3. Inference Mechanism and Interleaved Generation
TV2TV inference alternates between text and video generation, managed by a boundary token (BOF, "begin-of-frame"):
- Beginning in text mode, the model autoregressively samples tokens.
- Upon emission of a text token (), text generation continues.
- If BOF appears, TV2TV:
- Allocates a new video latent .
- Executes steps of an ODE solver (e.g., Euler) to denoise this latent conditioned on the current key-value cache.
- The denoised is appended to the sequence; text generation resumes.
- Generation halts at EOS or context limit.
A sliding-window generation strategy allows scaling beyond fixed context: half of the oldest tokens/chunks are discarded, retaining recent history for conditioning. This approach permits generation of extended video sequences.
4. Controllability and Editability
TV2TV enables fine-grained, mid-sequence control by decoupling "planning" (text) from "acting" (video):
- Users can modify the textual narrative at any BOF boundary by editing injected text prompts (e.g., inserting "jump now" or "reload weapon").
- The model continues video synthesis conditioned on the revised text trajectory, yielding prompt-aligned visual changes in subsequent frames.
In controlled interventions on CS:GO gameplay, this mechanism increased instruction-following accuracy from 59% (Think2V baseline) to 78% and improved visual quality scores (Han et al., 4 Dec 2025).
5. Empirical Results
TV2TV demonstrates significant advancements in both visual quality and controllability across synthetic and natural video domains:
- CS:GO Experiments:
- TV2TV outperforms T2V (no text planning) in video preference (91% vs 1% ties) and Think2V (plan-then-generate) in long sequences.
- Controllability: 78% instruction correctness versus 59% for Think2V under action interventions.
- Real-world Sports Videos:
- TV2TV surpasses Cosmos-2 and matches or trails other baselines (MAGI-1, WAN-2.2 5B) in measures of prompt alignment, fidelity, and holistic preference.
- Against T2V and Think2V, TV2TV achieves holistic preference improvements of +19% and +12%, and prompt alignment improvements of +20 and +12 points, respectively.
These gains are attributed to the offloading of high-level semantic planning to the text tower, leading to enhanced alignment between user prompts and video content, and to the interleaving mechanism that supports dynamic, user-driven interventions.
6. Limitations and Comparison to Related Work
- TV2TV’s dependency on synthetic vision-LLM (VLM) captions, particularly in open-domain datasets, can result in coarse segmentations (~1.9 seconds) and occasional hallucinations.
- The inference procedure incurs higher computational cost due to alternating text decoding and ODE-based denoising of video latents.
- Compared to frameworks such as UniVG (Ruan et al., 2024), which unifies text-to-video and image-to-video via latent diffusion and multi-condition cross attention, TV2TV uniquely addresses fine-grained temporal reasoning by interleaving text planning throughout generation, not just as an initial prompt.
- Both TV2TV and UniVG support classifier-free guidance and enable multi-modal conditional video generation, but TV2TV is singular in combining flexible, interleaved language-driven planning with pixel-level synthesis in an autoregressive loop.
7. Prospects and Future Research
Future work on TV2TV involves:
- Developing denser, more accurate segment-level captions, potentially through real-time VLM tagging.
- Optimizing diffusion solvers for greater inference efficiency.
- Extending the MoT routing mechanism to additional modalities (e.g., audio, dialogue).
- Creating end-to-end interactive user interfaces for live textual intervention and video editing.
TV2TV represents a step toward unified, foundation models for video generation that allow models to "reason in words and act in pixels," foreshadowing a new class of controllable, open-ended generative systems for complex sequential domains (Han et al., 4 Dec 2025).