Light-X: Generative 4D Video Rendering with Camera and Illumination Control
Abstract: Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces Light-X, a tool that can take a regular video shot with a single camera and re-create it as a new video where you can control both the camera path (where it moves and looks) and the lighting (like “sunset” or “neon lights”). The authors call this “4D” video because it includes 3D space plus time. Light-X lets you “revisit” the same scene from new angles and under different lighting, all while keeping things stable and realistic across frames.
Key Questions
Here are the main problems the paper tries to solve, explained simply:
- How can we change both the camera’s viewpoint and the lighting in a video at the same time, without the video flickering or looking fake?
- How can we train such a system when it’s really hard to get real videos of the same scene from many viewpoints and with many different lighting setups?
How It Works
To make the system easy to understand, think of a scene as two parts: the “shape and motion” (what’s where and how it moves) and the “light” (how bright things are, where shadows fall, what color the light is). Light-X carefully separates these two pieces and handles each one in a way that helps the model learn better.
Step 1: Geometry and Motion with “Point Clouds”
- A “point cloud” is like a 3D scatter of tiny dots that build your scene, similar to a 3D Lego set made of points.
- The system first estimates a “depth map” for each frame. A depth map tells how far each pixel is from the camera.
- Using these depth maps, it turns the original video into a dynamic point cloud that moves over time—this captures the scene’s geometry and motion.
- Then, using a user-chosen camera path (like “circle around the subject”), it projects that point cloud into the new viewpoints, making rough, geometry-aligned renders and visibility masks that say which parts are visible from that view. These cues strongly guide the model to keep the scene’s shape and motion correct in the new camera path.
Step 2: Lighting from a Single “Relit” Frame
- Changing video lighting is hard because it can flicker frame-to-frame. Light-X solves this by taking just one frame of the original video and “relighting” it (using a powerful image relighting model called IC-Light).
- This “relit frame” encodes the desired lighting style (e.g., “warm sunlight,” “cool neon”).
- The relit frame is lifted into its own point cloud using the same depths, so geometry matches perfectly.
- It’s then projected along the target camera path, creating lighting-aligned views and masks that tell the model where lighting information is available. This gives the model clear, frame-by-frame lighting hints tied to the actual scene geometry.
Step 3: Putting It All Together with a Video Diffusion Model
- A “diffusion model” is like a smart cleaner: it starts with noisy video and repeatedly removes noise, guided by the cues you provide, until a sharp video appears.
- Light-X feeds both sets of cues (geometry/motion views and lighting views) into a Video Diffusion Transformer (DiT) along with text tokens (like lighting prompts) and special “illumination tokens” extracted from the relit frame using a Q-Former (a small network that learns to ask the right questions to extract lighting features).
- These illumination tokens provide “global” control so the lighting stays consistent even in frames far from the original relit frame, reducing fading or sudden shifts.
Step 4: Training Data with “Light-Syn”
- Real training pairs with many camera views and lighting variations are very rare.
- The authors create a synthetic training pipeline called Light-Syn: they start from real videos (as “targets”), make a “degraded” version (like edited or relit) to serve as “input,” and then use inverse transformations to align geometry and lighting cues.
- This strategy builds many training pairs across static scenes, dynamic scenes, and AI-generated videos, helping the model learn robustly without needing real multi-view, multi-light captures.
Main Findings
After testing Light-X against other methods, the authors found:
- It can control camera movement and lighting together, which previous tools couldn’t do well.
- It produces videos with better lighting quality and fewer flickers over time than state-of-the-art video relighting tools.
- It stays consistent when the camera moves to new viewpoints, thanks to the point-cloud geometry cues.
- It works under different kinds of lighting instructions:
- Text prompts (e.g., “neon light” or “sunlight”)
- Background images (relight a foreground video to match a background)
- HDR environment maps and even reference images that provide lighting style
In simple terms: Light-X makes videos that look realistic, stay stable across frames, and match both your camera path and your lighting choices better than previous methods.
Why It Matters
Light-X opens up creative and practical possibilities:
- Filmmaking and video editing: Change camera motion and lighting after filming, saving time and money.
- AR/VR: Recreate scenes from new angles with lighting that matches virtual environments, making experiences more immersive.
- Content creation: Easily apply specific styles (like “golden hour” or “stage spotlight”) and move the camera smoothly around subjects.
This brings us closer to fully controllable, high-quality, generative videos of real-world scenes.
Limitations and Future Directions
The authors note a few current challenges:
- The quality of lighting depends on the single-image relighting step; if that step struggles, the final video may suffer.
- The method relies on estimated depth; bad depth leads to geometry errors, especially with very large camera moves (like turning all the way around a subject).
- Like other diffusion models, generating long, detailed videos can be slow and may struggle with fine details (e.g., hands).
They suggest future work on stronger video backbones, better point-cloud handling for wider camera ranges, and techniques to generate longer videos more efficiently.
Summary
- Light-X separates “shape/motion” from “lighting” using two aligned point clouds and fuses them with a diffusion transformer to generate stable, realistic videos with user-controlled camera paths and lighting.
- A clever training data pipeline (Light-Syn) makes it possible to learn this without needing rare multi-view, multi-light captures.
- Experiments show it outperforms previous methods in both lighting quality and temporal consistency across several tasks and input types.
- It could have significant impact on filmmaking, AR/VR, and creative content, while future improvements aim to handle wider camera motions, finer details, and longer videos.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, phrased to guide actionable future research.
- Physically-grounded lighting representation
- The model relies on a single relit frame and learned tokens rather than estimating explicit lighting (e.g., spatially varying light sources, environment maps) and material BRDFs; how to jointly infer physically meaningful illumination and reflectance to enable accurate view-dependent effects, cast shadows, and interreflections?
- Single-frame lighting cue and temporal decay
- Illumination is injected from one relit frame; lighting strength is observed to decay over time. What are effective strategies (e.g., multi-frame relit supervision, keyframe selection, recurrent or time-aware lighting tokens) for maintaining stable, time-consistent illumination across long sequences?
- Dynamic, time-varying light editing
- The current setup assumes largely static lighting. How to support temporally evolving lighting (moving light sources, flicker, day–night transitions) with user-controllable trajectories in time?
- Geometry prior limitations (per-frame depth, dynamic point clouds)
- Point clouds derived from monocular depth are error-prone and struggle under wide baselines and severe parallax. Can joint optimization of geometry (e.g., multi-view refinement during training), or explicit 3D/4D scene representations (GS/NeRF/meshes) improve occlusion handling, hole filling, and wide-range camera redirection (up to 360°)?
- Intrinsics/extrinsics calibration and scale ambiguity
- Camera intrinsics are set empirically, and global scale is not calibrated. How to perform self-calibration and resolve scale ambiguities from monocular videos to stabilize projection-based conditioning?
- Large motion, long-sequence scalability
- The model trains on 49-frame clips and struggles with very wide camera motions. What architectural or training strategies (streaming/online generation, memory-efficient DiTs, recurrent conditioning) enable long videos and large camera arcs without drift?
- Light-Syn data bias and ground-truth scarcity
- Training supervision is derived via “degraded” videos produced by other algorithms (IC-Light, LAV), risking bias inheritance and ceiling performance. How to build or simulate unbiased paired multi-view/multi-illumination datasets, or devise self-supervised disentanglement that avoids reliance on external models?
- Inverse mapping validity and conditioning uncertainty
- The inverse-mapping step that transfers geometry/illumination from target to source is assumed accurate but not quantitatively validated. Can we measure and model uncertainties of projected cues (depth/masks/lighting) and incorporate them via confidence-aware conditioning?
- Mutual illumination with background conditioning
- Background-conditioned relighting composites foregrounds and backgrounds but does not model bidirectional light transport (cast shadows, color bleeding). How to simulate physically plausible FG–BG light interaction when relighting?
- Non-Lambertian, transparent, and participating media
- View-dependent effects (specular highlights, glossy reflections), transparency, and volumetric scattering are not explicitly modeled or evaluated. What conditioning or representations better handle these phenomena under camera and lighting changes?
- Illumination control interfaces and disentanglement
- Illumination control relies on text/background/HDR/ref-image cues with fixed soft-mask weights. How to learn content-adaptive, uncertainty-aware fusion of multiple illumination modalities and expose interpretable controls (direction, intensity, color temperature) with predictable, disentangled effects?
- Optimal relit frame selection
- The relit frame is chosen arbitrarily (often the first). How to automatically select one or more keyframes (based on coverage, depth quality, saliency) and fuse their lighting cues for maximal temporal and spatial robustness?
- Handling dynamic scenes with severe non-rigid deformations
- The dynamic point-cloud approach may break under topological changes, occlusion swaps, or fast articulations. Can motion segmentation, layered representations, or motion-aware priors improve stability in complex dynamics?
- Shadow geometry and visibility consistency under novel views
- There is no constraint ensuring that occluder–occludee relationships and shadow placements remain consistent after camera redirection. How to incorporate differentiable visibility/shadow constraints to maintain physically plausible shading across views?
- Evaluation protocols and metrics
- Reported FID is computed against IC-Light results (a biased reference), and the “unbiased” setting still uses LAV for degradation and LLaVA-identified prompts. Can we establish standardized benchmarks with ground-truth multi-illumination videos and metrics targeting lighting fidelity (shadow accuracy, highlight placement, shading consistency) beyond aesthetic/CLIP/flow-based proxies?
- Material and albedo consistency
- The method preserves “identity,” but reflectance and albedo are not explicitly evaluated or controlled. How to disentangle and preserve/edit material properties under relighting while avoiding color drift?
- Robustness to camera/sensor artifacts
- Exposure changes, auto white balance, rolling shutter, motion blur, and noise can degrade depth and lighting cues. What pre/post-processing or model-internal normalization improves robustness to these factors?
- Conflict resolution between camera and illumination edits
- Joint control can introduce conflicts (e.g., implausible lighting under extreme camera motion). How to detect and resolve conflicts, provide user feedback, or constrain edits to remain physically plausible?
- Compute and latency
- Multi-step diffusion is computationally expensive; inference times are reported without detailed hardware/runtime analysis. Can distillation, consistency models, or latent compression reduce latency and memory while maintaining quality?
- Generalization, fairness, and domain coverage
- The training set composition (static/dynamic/AI-generated) may not cover all domains (e.g., portraits, crowded scenes, cultural artifacts). How to assess and improve generalization across demographics, object categories, and scene types, and report fairness metrics?
- Reproducibility and openness of Light-Syn
- The paper does not specify whether Light-Syn data or generation scripts will be released. Can releasing datasets, inverse-mapping code, and standardized prompts/trajectories enable reproducible benchmarking and accelerate progress?
- Uncertainty estimation and confidence-aware control
- Depth, masks, and lighting cues carry uncertainty, but conditioning treats them deterministically. How to model and propagate uncertainty (e.g., probabilistic masks, Bayesian conditioning) to improve robustness and prevent artifacts?
- Multi-modal illumination fusion at inference
- Simultaneous use of text + HDR + reference + background is not explored. What are principled strategies (gating, attention-based arbitration, causal constraints) for resolving ambiguous or conflicting lighting hints at test time?
Practical Applications
Below is an overview of practical applications of Light-X’s findings and innovations, grouped by deployment horizon. Each item briefly specifies the sector(s), likely tools or workflows, and assumptions or dependencies that affect feasibility.
Immediate Applications
The following applications are feasible with current offline pipelines and commodity GPU infrastructure, using Light-X’s disentangled camera–illumination control, dynamic point-cloud guidance, and the Light-Syn data pipeline.
- Post-production relighting with camera redirection for existing footage
- Sector: Film/TV, advertising, short-form content
- Tools/Workflow: NLE plugins for Adobe Premiere/After Effects/DaVinci Resolve; “Relight-Redirect” batch pipeline; text- or background-guided light edits; export in editorial codecs
- Assumptions/Dependencies: Sufficient GPU compute; reliable monocular depth for dynamic point clouds; legal rights to modify footage; IC-Light or equivalent single-image relighting quality
- Background-conditioned foreground relighting for compositing
- Sector: VFX, social media, virtual production
- Tools/Workflow: Foreground masks + background reference image/HDR map; soft illumination masks; Light-X relight pass prior to compositing
- Assumptions/Dependencies: Quality of foreground segmentation; accurate HDR/environment estimation; consistent camera intrinsics
- Text-prompt lighting for previsualization and storyboarding
- Sector: Entertainment production, indie creators
- Tools/Workflow: Rapid “look exploration” by prompting (“neon light,” “soft morning sunlight”) on existing video; iterate camera beats and lighting without reshoots
- Assumptions/Dependencies: Prompt-to-illumination alignment; single-image relighting fidelity; offline rendering time
- E-commerce product video variants under diverse lighting and viewpoints
- Sector: Retail/e-commerce, marketplace sellers
- Tools/Workflow: Batch generation of style-consistent lighting sets (studio, daylight, warm/cool) and reframed camera moves; A/B testing creative
- Assumptions/Dependencies: Stable product identity preservation; avoidance of material misrepresentation; clear disclosure of edits
- AR/VR content conversion from monocular footage (offline)
- Sector: XR content studios, education demos
- Tools/Workflow: Lift existing monocular videos to view-consistent sequences for limited free-viewpoint playback in Unity/Unreal; apply scene lighting variants
- Assumptions/Dependencies: Depth quality limits camera range; temporal consistency over longer sequences; offline preprocessing
- Social/UGC enhancement: relight and reframe mobile videos
- Sector: Consumer apps, creator tools
- Tools/Workflow: Cloud inference “relight+redirect” filters; text prompts; stylized lighting references
- Assumptions/Dependencies: Cloud latency and costs; IP and consent handling; model robustness for handheld camera shake
- Dataset augmentation for vision models with joint camera–illumination variations
- Sector: ML/CV research and industry (perception, tracking)
- Tools/Workflow: Light-Syn pipeline to produce paired variants; controlled lighting and viewpoint trajectories to stress-test algorithms
- Assumptions/Dependencies: Domain gap versus in-the-wild data; labeling integrity; compute for bulk generation
- Cinematography and lighting pedagogy
- Sector: Education (film schools, online courses)
- Tools/Workflow: Interactive exercises: change light type/direction/intensity via prompts and observe motion/geometry impacts; compare background- vs text-conditioned relighting
- Assumptions/Dependencies: Access to curated clips; instructor guidance on limitations (e.g., physical realism vs perceptual quality)
- Improved virtual backgrounds and meeting lighting (offline pre-processing)
- Sector: Enterprise communications, telepresence
- Tools/Workflow: Pre-process recorded sessions to harmonize subject foreground with target background lighting; reduce flicker across frames
- Assumptions/Dependencies: Privacy and disclosure; segmentation quality; not real-time
- Interior design mood boards from phone videos
- Sector: AEC/Design marketing
- Tools/Workflow: Quick “lighting mood” variants (day/night, warm/cool, accent lighting) for room walkthroughs captured on mobile
- Assumptions/Dependencies: Monocular geometry limits accuracy; relighting is generative/harmonization, not physically accurate daylighting
- Forensic and inspection pre-analysis (qualitative)
- Sector: Security, insurance claims review
- Tools/Workflow: Normalize lighting across frames to better inspect surfaces/IDs; explore viewpoint changes for context
- Assumptions/Dependencies: Must not be used as evidentiary truth; disclose manipulation; depth/relight inaccuracies can bias interpretation
- 3D reconstruction assistance via dynamic point cloud priors
- Sector: Photogrammetry workflows, archival digitization
- Tools/Workflow: Use Light-X projections to guide camera path selection and fill-in novel views to aid human-in-the-loop reconstruction
- Assumptions/Dependencies: Not a replacement for multi-view capture; geometry fidelity bounded by monocular depth estimation
Long-Term Applications
These rely on further research, scaling, and engineering to address noted limitations (depth accuracy, extreme camera motions, real-time performance, and physical lighting models).
- Real-time live broadcast relighting and camera control
- Sector: News, sports, live entertainment
- Tools/Workflow: Studio pipelines that harmonize on-air lighting with creative intent; operator-driven camera redirects; GPU clusters or edge accelerators
- Assumptions/Dependencies: Real-time diffusion acceleration; robust depth under motion/occlusion; transparent disclosure standards
- On-device mobile AR relighting with viewpoint adjustments
- Sector: Consumer mobile, AR platforms
- Tools/Workflow: Lightweight model variants; hardware acceleration (NPUs/GPUs); streaming or hybrid inference
- Assumptions/Dependencies: Model compression; battery/thermal constraints; user privacy
- 360-degree free-viewpoint generation in dynamic scenes
- Sector: XR, live events capture
- Tools/Workflow: Progressive point-cloud expansion; multi-trajectory control; integration with multi-view priors or SLAM
- Assumptions/Dependencies: Improved depth/geometry priors; scene completion beyond observed views; long-sequence consistency
- Robotics and autonomy simulation with domain-randomized lighting and camera paths
- Sector: Robotics, autonomous driving, drones
- Tools/Workflow: Integrate Light-X into synthetic data engines; vary illumination and viewpoints to train robust perception stacks
- Assumptions/Dependencies: Sufficient physical plausibility for sensor models; performance at scale; validation against real-world benchmarks
- Telepresence XR: ambient-matched relighting for remote collaboration
- Sector: Enterprise collaboration, remote training
- Tools/Workflow: Real-time environment-light estimation; harmonize user video with shared virtual spaces; consistent multi-user camera control
- Assumptions/Dependencies: Environment map accuracy; multi-user synchronization; latency constraints
- Digital twin content pipelines with controllable lighting for scenario testing
- Sector: Industrial IoT, safety training
- Tools/Workflow: Use Light-X to create observational “what-if” variants for training/communications; link with simulation engines
- Assumptions/Dependencies: Need physically-grounded rendering for engineering decisions; audit logs of generative edits
- Physically informed relighting for architecture/energy analysis
- Sector: AEC, energy modeling
- Tools/Workflow: Couple Light-X with physically-based daylight simulators; use generated variants for early-phase stakeholder communication
- Assumptions/Dependencies: Requires integration with physical light transport models; accuracy certification; camera calibration
- Cinematography co-pilot (“DP-in-the-loop”)
- Sector: Film/TV toolchains
- Tools/Workflow: Interactive system that suggests camera paths and lighting prompts; learns from scene metadata and director notes; version control of edits
- Assumptions/Dependencies: Human oversight; shot continuity rules; production asset tracking
- Synthetic media governance and provenance tooling
- Sector: Policy, platforms, media companies
- Tools/Workflow: Watermarking, edit provenance logs, mandatory disclosure UI, classifier assistance for manipulated lighting/camera
- Assumptions/Dependencies: Standards cooperation; platform adoption; privacy and consent frameworks
- Academic benchmarks and shared datasets for joint camera–illumination control
- Sector: Academia/Research
- Tools/Workflow: Extend Light-Syn to produce community datasets with diverse scenes and lighting references; define evaluation metrics and testbeds
- Assumptions/Dependencies: Licensing and ethics; diversity and bias audits; compute grants for reproducibility
- Multi-modal assistant for lighting prompt authoring
- Sector: Creative software
- Tools/Workflow: Prompt recommender tied to environment maps/reference images; style libraries; “illumination tokens” editing UI
- Assumptions/Dependencies: UI/UX refinement; creators’ acceptance; consistent mapping from prompts to controllable visual outcomes
Cross-cutting assumptions and dependencies
- Technical: High-quality monocular depth estimation; stable dynamic point clouds; robustness under large camera motion; diffusion model compute/time; integration with IC-Light or equivalent single-image relighting priors; environment map accuracy.
- Legal/ethical: Rights to modify content; transparent disclosure of synthetic edits; watermarking/provenance; consent for relighting of people; avoidance of deceptive use.
- Operational: GPU availability and cost; workflow integration with existing toolchains (NLEs, game engines); prompt engineering expertise; QA of identity and albedo preservation for brand/products.
Glossary
- 4D consistency (4DC): Consistency of appearance and geometry across 3D space and time. Example: "4D consistency (4DC, spatio-temporal coherence in the novel-view setting)"
- Aesthetic Preference metric: A combined measure of visual appeal and perceived image quality used for evaluation. Example: "Aesthetic Preference metric, defined as the mean of the aesthetic score and image quality"
- albedo: The intrinsic, lighting-independent reflectance of a surface. Example: "ID preservation (IP, consistency of the objectâs identity and albedo after relighting)"
- autoregressive transformers: Generative models that predict the next element in a sequence conditioned on previous ones. Example: "to autoregressive transformers~\citep{wu2022nuwa}"
- back-projection: Mapping image pixels with known depth back into 3D space. Example: "Each frame is then back-projected to 3D space to form a dynamic point cloud"
- camera extrinsics: External camera parameters (rotation and translation) defining pose relative to a world frame. Example: "camera intrinsics and extrinsics "
- camera intrinsics: Internal camera parameters like focal length and principal point. Example: " is the camera intrinsics matrix."
- CLIP similarity: A measure of semantic alignment using CLIP embeddings, here used across frames for temporal consistency. Example: "average CLIP~\citep{clip} similarity between consecutive frames"
- cross-attention: An attention mechanism that conditions one set of tokens on another. Example: "through cross-attention:"
- degradation-based pipeline: A data synthesis approach that creates paired training data by degrading videos and inverting the transformations. Example: "a degradation-based pipeline with inverse-mapping"
- Diffusion Transformers (DiT): Transformer-based architectures used as the backbone for diffusion models. Example: "Diffusion Transformers (DiT)~\citep{dit}"
- domain indicators: Signals that specify which conditioning domain is in use to aid multi-modal generalization. Example: "These soft masks act as domain indicators~\citep{4dnex}, enabling a single model to generalize across diverse illumination conditions."
- dynamic point cloud rendering: Rendering technique that projects time-varying point clouds along specified camera trajectories. Example: "Camera trajectories are modeled through dynamic point cloud rendering like~\citep{trajectorycrafter}"
- dynamic point clouds: Sequences of 3D points across frames representing moving scenes or cameras. Example: "we leverage dynamic point clouds as an explicit inductive bias"
- FID (Fréchet Inception Distance): A metric comparing the distribution of generated and reference images. Example: "Relighting quality is measured by FID~\citep{fid}"
- Fréchet Video Distance (FVD): A metric comparing the distribution of generated and reference videos. Example: "including PSNR, SSIM~\citep{ssim}, LPIPS~\citep{lpips}, and FVD~\citep{fvd}"
- geometric prior: Structural cues (e.g., projected views and masks) that guide the model toward geometrically consistent outputs. Example: "serve as a strong geometric prior"
- global illumination control module: A component that enforces consistent lighting across frames globally. Example: "we introduce a global illumination control module."
- HDR environment map: A high dynamic range panoramic map representing scene lighting used for relighting. Example: "an HDR environment map"
- IC-Light: A diffusion-based image relighting method used for high-fidelity illumination editing. Example: "IC-Light~\citep{iclight} employs a light-transport consistency loss"
- inverse mapping: Applying the inverse of a degradation transformation to align cues between domains. Example: "By applying the inverse mapping of the degradation process"
- inverse perspective projection: Mapping from 2D image coordinates and depth back to 3D coordinates. Example: "where denotes the inverse perspective projection"
- Light-DiT layer: A specialized DiT layer introduced to inject global illumination tokens for consistent lighting. Example: "we introduce a light-DiT layer that enforces global illumination consistency."
- Light-Syn: The proposed data pipeline that synthesizes paired training data via degradation and inverse mapping. Example: "we introduce Light-Syn, a degradation-based pipeline"
- LPIPS: A perceptual distance metric measuring similarity based on deep features. Example: "LPIPS~\citep{lpips}"
- monocular videos: Videos captured from a single camera/viewpoint. Example: "from monocular videos"
- Motion Preservation: A metric quantifying how well motion is preserved relative to the source video. Example: "Motion Preservation, computed as the deviation between RAFT~\citep{raft} estimated optical flow and that of the source video."
- novel-view video synthesis: Generating video frames from viewpoints not present in the source footage. Example: "novel-view video synthesis with accurate camera motion and strong spatio-temporal consistency."
- optical flow: The pixel-wise apparent motion between consecutive frames. Example: "optical flow"
- patchified: The process of converting spatial feature maps into sequences of patch tokens for transformers. Example: "patchified into a sequence of vision tokens"
- PSNR: A signal fidelity metric measuring reconstruction quality in decibels. Example: "PSNR"
- Q-Former: A query-based transformer that extracts compact features from visual tokens. Example: "we employ a Q-Former~\citep{li2023blip} to extract illumination information."
- Ref-DiT: A reference-conditioned DiT module used to preserve consistency with input video cues. Example: "we retain the original DiT and Ref-DiT modules"
- relighting (video relighting): Editing or re-rendering a scene/video under different lighting conditions. Example: "video relighting"
- soft-weighted illumination mask: A mask with fractional weights indicating the confidence or strength of lighting cues. Example: "a soft-weighted illumination mask enables seamless integration of diverse lighting cues"
- spatio-temporal consistency: Coherence of content across both spatial dimensions and time. Example: "strong spatio-temporal consistency"
- SSIM: A perceptual similarity metric focusing on structural information. Example: "SSIM~\citep{ssim}"
- VAE decoder: The decoder component of a Variational Autoencoder used to reconstruct videos from latents. Example: "a VAE decoder reconstructs a high-fidelity video"
- video diffusion priors: Learned generative priors from diffusion models applied to video synthesis tasks. Example: "we leverage video diffusion priors for controllable video synthesis."
- viewpoint control: Conditioning generation on camera parameters to control the rendered view. Example: "for viewpoint control."
- visibility masks: Binary or soft masks indicating which projected points are visible in the target view. Example: "visibility masks"
Collections
Sign up for free to add this paper to one or more collections.