Generative Image & Video Models

Updated 10 February 2026

Generative image and video models are machine learning frameworks that generate new visual content by learning complex spatial and temporal distributions.
They employ diverse architectures—such as GANs, VAEs, diffusion models, and normalizing flows—to capture appearance, motion, and 3D consistency.
These models drive applications like creative synthesis, controllable editing, 3D reconstruction, and compression, leveraging metrics like FID, FVD, and STREAM.

Generative image and video models are a family of machine learning frameworks designed to synthesize new visual data—images, videos, or scene sequences—by sampling from learned or conditioned distributions over complex, high-dimensional spaces. Distinct from pure discriminative models, these generative systems must capture and reconstruct rich appearance, structure, motion, and temporal dynamics, providing a basis for tasks spanning creative synthesis, controllable editing, 3D reconstruction, data augmentation, compression, and more.

1. Architectural Frameworks

Generative image and video models are realized through several core architectural paradigms: Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs, VQ-VAEs), diffusion models (DMs), continuous normalizing flows (CNFs), and neural ODEs. Each approach encodes different inductive biases with respect to visual structure or temporal evolution.

GANs and Adversarial Pipelines: GANs traditionally confront a generator against one or more discriminators, scoring samples on realism. Extensions such as SIV-GAN apply multi-branch discriminators to disentangle content realism ("what is present") from layout realism ("where objects appear"), enabling learning from extremely low-data regimes such as single images or a handful of video frames (Sushko et al., 2021). 3D-aware GANs fuse neural implicit representations (e.g., NeRF fields) with either time-aware or view-aware discriminators to support spatial, temporal, and even viewpoint consistency in generated videos (Bahmani et al., 2022).
Latent Variable Models (VAEs, VQ-VAEs, Residual-Latent Models): VAEs and their discrete (vector-quantized) extensions encode inputs into structured, compressed latent codes, then decode to images or frames. Jointly trained image–video VAEs enable a shared prior for still and dynamic samples, accommodating both static content and per-frame (or per-sequence) deviations (Dandi et al., 2019). Low-rank video VAEs further regularize the multi-frame latent space, supporting tasks like video inpainting or interpolated prediction with subspace constraints (Hyder et al., 2019).
Diffusion and Flow-based Models: Diffusion models, including DDPMs and their v-predicting/generalized forms, now dominate text-to-image and text-to-video synthesis, with video extensions employing space–time UNet backbones, cascaded super-resolution modules, or autoregressive/conditional prediction (Ho et al., 2022, Liu et al., 2024, Wang et al., 2024). Rectified-flow Transformers offer an alternative, leveraging linear interpolation in the latent space and direct ODE-based sampling for unified image-video generation (Chen et al., 7 Feb 2025).
Continuous and Masked Pretraining: Masked video modeling (e.g., VideoMAE-style) pretrains asymmetric encoder–decoder Transformers using high-ratio tube masking and reconstruction, learning strong spatiotemporal features from large-scale unlabeled video corpora (Wang et al., 2022). Unified CNF-based models provide reversible mappings from Gaussian noise to visual data, with multi-resolution variants reducing computational costs (Voleti, 2023).
Hybrid Structures for Multi-view and 3D Imaging: Recent approaches leverage video-based diffusion models as backbones for multi-view image synthesis and 3D asset creation, exploiting the inherent spatio-temporal consistency of video generators and augmenting sampling with explicit 3D scene reconstructions (e.g., 3D Gaussian splats) (Zuo et al., 2024).

2. Training Protocols and Objectives

The learning objectives for generative image and video models are tailored to encourage both high quality of outputs and diversity of generated samples. Widely-adopted losses include:

Adversarial Losses: Min-max games are employed to pit generator and discriminator against each other, with separate branches enforcing appearance, layout, or temporal consistency (Sushko et al., 2021, Bahmani et al., 2022).
Reconstruction and Likelihood-based Losses: VAEs and normalizing flows maximize log-likelihood (or surrogate ELBOs), via pixel-wise or perceptual reconstruction losses and regularization by KL-divergence or code similarity (Dandi et al., 2019, Voleti, 2023).
Noise-Prediction and Diffusion Losses: Diffusion models minimize mean squared error between predicted and actual noise at each time step, optionally using v-parameterization for stability (Ho et al., 2022). Classifier-free guidance is now standard for improving conditional fidelity (Ho et al., 2022, Arkhipkin et al., 19 Nov 2025).
Diversity and Regularization Terms: Explicit diversity-promoting objectives, such as pairwise feature distance regularization, are incorporated to avoid mode collapse and improve output variety in limited data regimes (Sushko et al., 2021).
Self-supervised and Masked Objectives: High-ratio tube masking (for videos) or patch masking (for images) compels models to learn robust representations by reconstructing occluded regions (Wang et al., 2022).
Task-specific Supervisory Signals: In video editing and propagation, mask prediction decoders and region-aware losses allow the model to selectively reconstruct and propagate changes, aligning generative capacity with downstream requirements (Liu et al., 2024).
Auxiliary Losses: These include perceptual or adversarial reconstruction for enhanced image realism at low or variable bitrate (e.g., in latent compression codecs) (Qi et al., 22 May 2025).

3. Temporal, Spatial, and Semantic Consistency

Ensuring both spatial and temporal coherence is a central challenge for generative video models. Techniques include:

Temporal Modules: Temporal self-attention, temporal convolutions, and cross-frame attention are embedded within denoising networks to encourage consistent content and motion (Ho et al., 2022, Zuo et al., 2024). For instance, cascaded temporal SR modules in Imagen Video progressively increase frame rate and maintain dynamics (Ho et al., 2022).
Content–Layout–Motion Disentanglement: Explicit separation of appearance (content), spatial arrangement (layout), and temporal (motion) is realized in both generator and discriminator components—SIV-GAN’s dual-branch discriminator is extended to multi-branch architectures for spatio-temporal video consistency (Sushko et al., 2021).
Editing and Propagation via Generative Priors: Frameworks such as GenProp distinguish editable from static regions through a Selective Content Encoder, then “propagate” edits via a frozen image-to-video diffusion model, eschewing flow-based or mask-propagation heuristics (Liu et al., 2024).
Pseudo-Video Augmentation: Training on “pseudo videos”—constructed sequences from augmented versions of single images—provides self-supervised intermediate targets for diffusion and VAE models, increasing sample quality and regularization (Chen et al., 2024).
Latent Space Compatibility and Compression: Video VAEs are designed for strict alignment with image VAEs (e.g., Stable Diffusion), permitting seamless swapping and joint training across modalities, yielding both spatial and temporal compression with plug-and-play capabilities (Zhao et al., 2024).

4. Evaluation Metrics and Benchmarking

Comprehensive evaluation of generative image and video models requires metrics sensitive to spatial, temporal, and semantic quality:

Image Metrics: FID, Inception Score, LPIPS, and SIFID (single-image FID) are commonly used for image fidelity and diversity (Sushko et al., 2021, Dandi et al., 2019).
Video Metrics: Frechet Video Distance (FVD), Video Inception Score (VIS), and metrics such as PSNR, SSIM, and mean absolute error ( $\mathcal{M}$ ) are used for video sequence comparison (Ho et al., 2022, Cho et al., 2024). However, these may conflate spatial and temporal errors or cap evaluation to limited frame lengths (Kim et al., 2024).
Decoupled Spatio-Temporal Metrics: The STREAM metric independently measures spatial fidelity/diversity (via mean embedding coverage and precision–recall) and temporal naturalness (via power spectrum skewness of feature time-series), supporting arbitrary video lengths and better reflecting human judgment of realism and motion (Kim et al., 2024).
Human Annotation Corpora: The GeneVA benchmark provides large-scale, manually labeled data on spatio-temporal artifacts (bounding boxes, free-form descriptions) across video generators, enabling fine-grained detection, understanding, and reward shaping for artifact mitigation (Kang et al., 10 Sep 2025).
Task-Level and Bandwidth Metrics: For compression codecs and controllable synthesis, rate-distortion curves, F1 (for canny alignment), and VL scores (subject-driven) are applied to capture trade-offs between byte-efficiency, fidelity, and controllability (Qi et al., 22 May 2025, Cao et al., 29 May 2025).

5. Applications and Specializations

Generative image and video models have been adapted to an array of specialized workflows:

Low-Data Synthesis and One-Shot Learning: Models such as SIV-GAN can learn to generate diverse samples from a single image or video, surpassing conventional patch-based models by explicit architectural disentanglement and heavy data augmentation (Sushko et al., 2021).
Controllable Editing and Propagation: GenProp unifies video editing (insertion, removal, tracking, shape deformation) through a generative propagation formalism, completely removing the dependence on optical flow or per-frame mask propagation at test time (Liu et al., 2024).
Image–to–Video and Keyframe Interpolation: Modern architectures enable interpolation between two frames (“generative inbetweening”) by lightweight adaptation of forward-trained video diffusion models into backward-time variants and dual-directional sampling (Wang et al., 2024).
Compression and Bandwidth Reduction: Generative latent codecs move quantization and entropy coding into perceptually aligned latent spaces, drastically reducing bitrate while maintaining fidelity, e.g., via GLC’s vector-quantized VAEs and semantic hyper modules (Qi et al., 22 May 2025).
3D-Aware and Multi-View Generation: Video-based diffusion backbones fine-tuned for multi-view consistency produce large numbers of viewpoint-consistent images and even explicit 3D Gaussian asset representations, supporting downstream 3D reconstruction or rendering tasks (Zuo et al., 2024).
Unified Multimodal Generation: Foundation models such as Kandinsky 5.0 and Goku implement single architectures with joint attention and fully compatible VAEs to synthesize both images and videos, leveraging vast training corpora and modular inference/sampling regimes (Arkhipkin et al., 19 Nov 2025, Chen et al., 7 Feb 2025).
Downstream Video Understanding: Models pretrained by masked video modeling objectives (as in InternVideo) provide state-of-the-art representations for action recognition, temporal localization, and video–language alignment, confirming the foundational efficacy of generation-conditioned learning (Wang et al., 2022).

6. Limitations and Future Directions

Despite the substantial progress, key challenges remain:

Temporal Coherence at Scale: Many models, especially in ultra-low data settings, lack explicit temporal discrimination mechanisms and may produce flicker or physically implausible motion (Sushko et al., 2021, Kang et al., 10 Sep 2025). Multi-branch discriminators, temporal regularization, or explicit motion models are active research directions.
Artifact and Failure Mode Characterization: Human-annotated datasets such as GeneVA expose gaps in automatic metric alignment and ground-truth artifact frequency, highlighting the need for better alignment between algorithmic and perceptual quality (Kang et al., 10 Sep 2025).
Data and Compute Efficiency: Foundation-scale models require extensive data curation, filtering, and distributed infrastructure. Efficient distillation and attention mechanisms (e.g., NABLA) are essential for deployment, but further work on lightweight, adaptable models is ongoing (Arkhipkin et al., 19 Nov 2025, Chen et al., 7 Feb 2025).
Cross-Modal and Conditional Generation: Extending models to robustly handle complex tasks such as scene-conditioned, multi-object, or audio-visual generation remains open, as does better control of inter-object relations and high-level dynamics (Zuo et al., 2024, Chen et al., 7 Feb 2025).
Benchmarking and Standardization: The community is actively developing unified and interpretable metrics for spatio-temporal consistency (e.g., STREAM (Kim et al., 2024)) and more nuanced human-centric benchmarks (e.g., GeneVA (Kang et al., 10 Sep 2025)).

In sum, generative image and video models encapsulate a rapidly evolving domain where advances in architecture, training, evaluation, and practical applications continue to expand the boundaries of data-efficient and controllable visual synthesis.