Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoCha:End-to-End Video Character Replacement without Structural Guidance

Published 13 Jan 2026 in cs.CV | (2601.08587v2)

Abstract: Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: orange-3dv-team.github.io/MoCha

Summary

  • The paper presents the first end-to-end non-reconstruction framework for video character replacement that requires only a single mask input.
  • It leverages condition-aware RoPE and a diffusion transformer to achieve robust temporal consistency and enhanced identity preservation.
  • Experimental results show significant gains in SSIM, LPIPS, and PSNR over baselines, enabling practical deployment in real-world scenarios.

MoCha: An End-to-End Paradigm for Video Character Replacement without Structural Guidance

Introduction and Motivation

MoCha presents a significant methodological departure in the video character replacement task, addressing long-standing challenges in generating high-fidelity, temporally coherent videos when replacing a character given only sparse input cues. Previous paradigms are predominantly reconstruction-based, relying on dense, per-frame segmentation masks and explicit structural guidance (e.g., pose, depth) to perform spatial locking and motion transfer. This approach, while effective in constrained scenarios, suffers severely in practical video settings where occlusions, complex poses, challenging illumination, and multi-character interactions are the norm. The strict requirements on mask sequences and explicit guidance pose scalability and generalization bottlenecks—limiting their efficacy for film, gaming, and content creation pipelines.

MoCha advances the state of the art by introducing the first end-to-end, non-reconstruction, implicit learning architecture for video character replacement. It requires only a single arbitrary frame mask as spatial localization input and forgoes any explicit structural signals, leveraging the inherent spatiotemporal reasoning capabilities of modern video diffusion models. This design obviates dense annotation and increases robustness, scalability, and deployment feasibility for real-world applications. Figure 1

Figure 1: Contrast between reconstruction-based paradigms requiring dense masks/structural guidance (a), versus MoCha's non-reconstruction, single-frame mask approach (b).

Methodology

Data Construction Pipeline

Key to the development of MoCha is a scalable, high-quality, paired training dataset construction pipeline, comprising:

  1. High-Fidelity Synthetic Data: Leveraging Unreal Engine 5, paired videos are rendered with identical motion but varied character appearances, with explicit character masks and diverse reference imagery. This affords perfect ground truth alignment, critical for supervised replacement.
  2. Expression-Driven Face Animation Data: Using image inpainting and advanced portrait animation techniques, pairs are constructed to decouple identity and expression, enhancing facial expressiveness and robustness.
  3. Augmented Real-World Video Data: Large-scale real video-mask pairs from public datasets (e.g., VIVID-10M, VPData) enhance realism, mitigate synthetic domain artifacts, and increase the ecological validity of training. Figure 2

    Figure 2: Overview of the three-pronged data construction: synthetic rendered pairs, expression-driven face animation, and augmented video-mask pair integration.

End-to-End Replacement Architecture

MoCha is trained using a DiT-based (Diffusion Transformer) video latent diffusion model under the Rectified Flow framework, with essential innovations:

  • In-Context Learning with Condition-Aware RoPE: All conditioning inputs—source video, single-frame mask, and multiple reference images—are encoded via a 3D VAE and concatenated temporally. A novel variant of Rotary Positional Embedding (condition-aware RoPE) is constructed to flexibly distinguish, align, and fuse these heterogeneous tokens. Critically, this enables frame-to-frame correspondence, variable mask positioning, and the seamless aggregation of multi-referential conditioning cues.
  • Identity-Enhancing RL-Based Post-Training: Post-finetuning, a reinforcement learning phase leverages a differentiable ArcFace-based reward to explicitly optimize facial identity consistency between outputs and reference images. This is combined with a dense MSE reconstruction objective to avoid reward hacking (i.e., naïve reference copying). LoRA (Low-Rank Adaptation) is used for parameter-efficient tuning in this stage. Figure 3

    Figure 3: MoCha's two-stage pipeline: (a) In-context learning with condition-aware RoPE and (b) RL-based post-training for refined facial consistency.

Experimental Evaluation

Quantitative and Qualitative Results

In comprehensive benchmarks both on synthetic data (rendered, perfectly aligned) and on a challenging real-world set, MoCha is shown to surpass prior state-of-the-art systems, specifically in:

  • Identity Preservation: MoCha achieves a substantial improvement in SSIM (0.746), LPIPS (0.152), and PSNR (23.09) over competitive baselines such as Wan-Animate and HunyuanCustom.
  • Temporal and Structural Consistency: On VBench suite metrics, MoCha attains the highest scores for subject and background consistency, as well as motion smoothness and temporal flicker suppression.
  • Qualitative Fidelity: Outputs show superior lighting, facial expression transfer, and seamless integration into original scenes, including in scenes with occlusion, unusual motion, and multi-subject interaction. Figure 4

    Figure 4: Qualitative comparison against state-of-the-art methods: MoCha demonstrates improved animation consistency, facial expressivity, and lighting fidelity.

Ablation Studies

Ablations underscore the importance of both real-human video data for photorealism and expression transfer, and the RL-based post-training phase for robust facial identity. The tracking ability of MoCha's diffusion model, needing only a single-frame mask, is visually confirmed by attention score maps, which show stable tracking across frames. Figure 5

Figure 5: Ablation illustrating that real-human data significantly increases realism and facial fidelity.

Figure 6

Figure 6: RL-based identity post-training is critical for enhancing facial consistency with minimal degradation in other quality aspects.

Figure 7

Figure 7: Attention visualization shows MoCha robustly tracks the character specified by only a single mask frame across the sequence.

Performance in Complex Scenarios

MoCha robustly handles cases involving complex illumination, occlusion, object interaction, and multi-character scenes, demonstrating resilience and high visual quality under practical production conditions. Figure 8

Figure 8: MoCha's output on real-world challenging cases, confirming robustness and suitability for production-grade applications.

Implications and Future Directions

MoCha's non-reconstruction, implicit paradigm represents a foundational shift in controllable video editing, substantially lowering the barriers for practical deployment in content creation, entertainment, and advertising. By leveraging robust spatiotemporal reasoning, the need for dense structural annotations is eliminated, significantly reducing the annotation and computational overhead.

Notably, the architecture generalizes beyond human character replacement, accommodating non-human subject swaps and downstream editorial tasks (e.g., face swapping, virtual try-on) with minimal modification. The compositional, multi-reference conditioning and RL-based post-training align with evolving trends in controllable video generation, and set precedents for leveraging transformer-based in-context learning in video, potentially influencing future research in multimodal understanding, few-shot adaptation, and self-supervised video analysis.

Further research is anticipated in scaling richer conditioning, improving sample efficiency, and extending to personalized or interactive generative editing.

Conclusion

MoCha introduces an end-to-end, structurally agnostic framework for high-fidelity video character replacement. Through the introduction of condition-aware RoPE, RL-based post-training, and a comprehensive multi-source data pipeline, MoCha demonstrates significant empirical gains over existing methods in identity transfer, temporal consistency, and visual realism. Its capacity to generalize to complex, real-world scenarios without dense annotation establishes MoCha as a practical and theoretically compelling advance in controllable video generative modeling (2601.08587).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces MoCha, a new tool that can replace a person or character in a video with another character while keeping everything else the same—like the background, camera motion, lighting, and the way the person moves and makes expressions. Unlike older methods that need lots of extra information (like a detailed cut-out of the person in every single frame, or a stick-figure skeleton of their pose), MoCha works end-to-end and only needs a single mask from one frame to show which person should be replaced.

What questions is the paper trying to answer?

  • Can we swap a character in a video while keeping the original motion, expressions, background, and lighting—without using complicated extra inputs?
  • Is it possible to guide a powerful video generation model using minimal hints (a single-frame mask and a few reference images) so it learns to track and replace the right person across the whole video?
  • Can we make the replaced face look more like the target identity, consistently across frames?
  • How can we train such a system if there aren’t many perfect “paired” videos (where the same motion is performed by two different characters) available in the real world?

How MoCha works (in everyday language)

Think of a video like a flipbook: a stack of pictures that create motion when you flip through them. MoCha takes the original flipbook and a few sample photos of the new character you want to insert. It also gets one “sticky note” (a mask on one frame) that points to which person to replace. Then it generates a new flipbook where the chosen person is swapped, but everything else—in the scene and the motion—stays consistent.

Here are the key ideas behind MoCha:

  • Video diffusion modeling: This is a way computers “paint” videos by starting from random noise and gradually turning it into a realistic sequence, step by step—like carving a statue from a rough block, refining it over time.
  • In-context learning: Instead of giving the model strict rules, MoCha feeds it all the helpful clues at once: the source video, the single-frame mask, and reference images of the new character. By seeing these together, the model learns how to copy the motion and expressions from the source and apply them to the new identity, while keeping the background and lighting intact.
  • Condition-aware RoPE (positional clues): Computers don’t naturally know “what comes first” or “which token belongs to which frame.” MoCha uses special positional tags (like time-and-place labels) to tell the model:
    • These tokens belong to the source video and target video and should line up frame-by-frame.
    • This token is the mask that marks the person to replace (on a specific frame).
    • These tokens are reference images of the new character (they get a special “reference” label).
    • This structured tagging helps the model understand how to combine all the inputs correctly.
  • Identity-Enhancing post-training (reward-based fine-tuning): After the main training, MoCha gives itself a “score” for how similar the generated face is to the reference identity, using a face-recognition tool (ArcFace). It then tweaks its behavior to improve that score. To avoid “cheating” (like pasting the reference photo directly onto the video), it also checks that the overall video still looks natural and close to the ground truth, and only adjusts the later steps where fine details appear.
  • Smart training data creation: Perfect training pairs are rare (we want two videos with exactly the same motion in the same scene, but different characters). So the authors built their own:
    • High-quality rendered pairs using Unreal Engine 5: They create scenes where different characters perform the exact same motion with the same camera and lighting.
    • Expression-driven pairs: They swap a face in a photo (using an inpainting model) and then animate both the original and the swapped image with the same expression video, so the motion matches but the identity differs.
    • Augmented real video data: They add real videos with character masks (from public datasets) and carefully filter and improve them, so the model learns from real-world textures and lighting too.

What did they find?

  • MoCha beats other state-of-the-art methods in how well it keeps the person’s identity, how smooth the motion looks over time, and how accurately expressions are reproduced.
  • It maintains original lighting and background details better than methods that “reconstruct” the character region with heavy guidance.
  • MoCha works even in hard cases: occlusions (when people or objects overlap), unusual poses (like acrobatics), and multi-person interactions (like hugging or fighting).
  • The model can track the chosen character across the entire video with only a single-frame mask. This shows that modern video diffusion models have strong built-in “tracking” ability.
  • Adding real-human data and the reward-based post-training both noticeably improve realism and face consistency.

Why this matters:

  • It reduces the amount of extra input and manual work (no need for a perfect per-frame cut-out or pose skeletons).
  • It makes character replacement more robust and practical for real-world editing.

What’s the potential impact?

MoCha can simplify and speed up many tasks:

  • Film and TV post-production: replacing actors or avatars while preserving performance, camera work, and scene lighting.
  • Personalized advertising and digital avatars: swapping in a customer’s face or a brand character while keeping natural motion.
  • Virtual try-on and face swapping: editing clothes or faces in videos more realistically.
  • Research: It shows that video diffusion models can do reliable tracking and identity transfer with minimal guidance, pointing to future tools that are easier to use and more flexible.

In short, MoCha shows that with smart conditioning and training, we can cleanly replace characters in videos using just a single-frame hint and a few reference photos—keeping everything else believable and consistent.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, formulated to guide future research:

  • Sensitivity to the single-frame mask: how performance degrades with imperfect masks (coarse, off-target, low IoU, wrong frame) and whether weaker inputs (points, boxes, scribbles) or automatic mask selection could suffice.
  • Ambiguity with multiple similar subjects: reliability of single-frame masking to consistently track the intended target when multiple look-alike characters appear or re-enter the scene.
  • Multi-target replacement: ability to replace several characters in the same video (simultaneously or sequentially), required architectural or training changes, and dataset construction for multi-subject supervision.
  • Long-form and multi-shot videos: robustness across shot boundaries, hard scene cuts, large time gaps, and re-identification after occlusions or exits/entries; need for shot detection and per-shot reconditioning strategies.
  • Extreme motions and viewpoints: systematic evaluation on rare poses (e.g., acrobatics), rapid motion, motion blur, extreme scale/zoom, and large viewpoint changes beyond training lengths.
  • Temporal identity drift: quantitative analysis of identity consistency over long durations and under heavy occlusion; potential need for temporal identity constraints or memory mechanisms.
  • Lighting preservation quantification: lack of photometric or physically grounded lighting metrics; propose evaluating estimated illumination (e.g., SH coefficients), reflectance consistency, and relighting-invariance.
  • Physical plausibility without explicit structure: risks of limb/body interpenetration and contact violations in interactions; explore contact-aware losses, depth reasoning, or learned 3D priors to improve physical realism.
  • Audio and lip-sync: handling speaking subjects, audio-driven mouth motion preservation, and audiovisual synchronization during replacement.
  • Condition-aware RoPE ablation: missing comparisons against alternative positional schemes (learned relative positions, rotary variants, explicit temporal alignment, gating between conditions) and analysis of generalization to arbitrary lengths.
  • Frame-index mapping for the mask (f_M = (F-1)//4 + 1): rationale, sensitivity, and failure modes for long sequences; whether learned or continuous time encodings would be more robust.
  • Conditioning design and efficiency: full self-attention over concatenated tokens is O(T2); investigate sparse attention, cross-attention to conditions, or key-value caching to reduce memory/latency at inference.
  • Reference image requirements: how many, which viewpoints/expressions, and what quality are needed; automatic selection/curation, active prompting, or retrieval strategies to cover pose/lighting diversity.
  • Copy-paste avoidance beyond MSE: evaluate more principled anti-copy objectives (e.g., patch-level contrastive penalties, identity-preserving but pose-invariant constraints) to prevent literal reference imprinting.
  • RL post-training side effects: thorough analysis of reward hacking, background degradation, oversharpening, and identity overfitting; ablate number of optimized steps K, reward scales, and alternative face-identity rewards.
  • Reward design and normalization: the loss mixes (1 - ArcFace similarity) and pixel MSE without reported weighting; examine scaling, normalization, and composition to stabilize optimization and trade-offs.
  • Identity metrics at evaluation: include explicit face-ID similarity (ArcFace/CosFace), temporal face-ID stability, and re-ID scores (e.g., IDF1) instead of relying primarily on SSIM/LPIPS/PSNR/VBench.
  • Generative video quality metrics: add FVD and long-horizon temporal coherence measures; report per-attribute breakdowns (motion, identity, background consistency) to pinpoint failure modes.
  • Fair quantitative comparison to closed-source systems: design reproducible human studies or controlled small-batch evaluation for tools like Kling to avoid selection bias.
  • High-resolution scalability: current training/evaluation at 480×832; assess 1080p/4K via tiling, latent upscalers, or multi-scale schemes, and quantify identity/temporal trade-offs at higher resolutions.
  • Runtime and resource footprint: report training/inference latency, memory, and cost; study speed–quality trade-offs, batching strategies, and deployability on commodity GPUs.
  • Dataset realism and coverage: synthetic-heavy paired data and proxy “paired” real data may underrepresent real-world complexities (loose clothing, hair, accessories, crowded scenes); expand with curated, consented real paired sets and annotate difficult attributes (occlusion types, contacts).
  • Demographic, stylistic, and domain bias: quantify performance across skin tones, ages, genders, body shapes, non-human/stylized subjects; analyze base-model and reward bias (e.g., ArcFace bias to frontal faces).
  • Non-human and cross-domain replacement: systematically benchmark on pets, fantasy/CG characters, and objects; identify failure patterns and required adaptations (category-specific priors or rewards).
  • Robustness to heavy occlusion and interactions: controlled tests for hand-object and body-body contacts; measure physical plausibility and identity retention under complex interplays.
  • Multi-reference usage policy: study how different combinations (frontal vs profile, close-up vs full-body) affect outcomes; design learned reference fusion or dynamic reference memory over time.
  • Generalization to other base models: portability of condition-aware RoPE and the training recipe to different T2V architectures (e.g., CogVideoX, HunyuanVideo) and smaller backbones.
  • 3D consistency under camera motion: assess view-consistent identity across significant parallax and camera moves; consider integrating monocular geometry cues or 3D-aware diffusion.
  • Automatic or interactive mask acquisition: explore point/box prompts with SAM2, uncertainty-aware mask refinement, and user-in-the-loop corrections; quantify sensitivity to mask errors.
  • Failure case documentation: provide a taxonomy of typical failures (mis-tracking, identity drift, lighting mismatch, limb artifacts), with diagnostics and suggested mitigations.
  • Reproducibility details: specify missing hyperparameters (e.g., exact K for RL steps, reward scaling, RoPE offsets), dataset splits, seed control, and code/data licensing (film image legality).
  • Safety and misuse prevention: add consent workflows, watermarking, provenance, deepfake detection compatibility, and policy controls to mitigate identity misuse in deployment scenarios.

Practical Applications

Practical Applications Derived from MoCha

Below is a structured synthesis of practical uses that stem from MoCha’s findings, methods, and innovations, organized by time-to-deploy horizon and linked to relevant sectors. Each item includes potential tools/workflows and assumptions or dependencies that affect feasibility.

Immediate Applications

  • High-fidelity character replacement for post-production
    • Sector: Film/TV, VFX
    • What: Swap an actor or character in a shot while preserving background, lighting, motion, occlusions, and interactions using only a single-frame mask and a set of reference images.
    • Tools/Workflows: NLE plugin for Premiere/DaVinci; SAM2-assisted single-frame mask capture; multi-reference library management; batch rendering; QA with SSIM/LPIPS/VBench.
    • Assumptions/Dependencies: GPU access; licenses/consents for likeness usage; robust reference images; integration with studio asset pipelines.
  • Identity-preserving face swap for influencer marketing and social media
    • Sector: Advertising, Creator Economy
    • What: Rapidly customize short-form videos with brand ambassadors or campaign-specific avatars, maintaining expression and lighting realism.
    • Tools/Workflows: Mobile app/API; multi-reference image support; automated watermarking/provenance metadata.
    • Assumptions/Dependencies: Explicit consent; misuse safeguards; platform policies for synthetic media disclosure; reliable ArcFace identity signal across demographics.
  • Localization of video content via actor replacement
    • Sector: Media localization, Entertainment
    • What: Replace on-screen talent in localized versions while keeping motion and scene dynamics intact.
    • Tools/Workflows: Localization Studio pipeline; per-market reference sets; single-frame mask capture; paired audio dubbing/ADR.
    • Assumptions/Dependencies: Talent IP/licensing; audio-lip sync handled by separate modules; legal compliance with local deepfake laws.
  • Privacy-preserving video anonymization
    • Sector: Security/Privacy, Compliance
    • What: Replace identifiable faces with approved synthetic avatars instead of blurring, preserving action context for training, audits, and public sharing.
    • Tools/Workflows: “Privacy Masker” service; SAM2 for mask selection; identity/consent ledger; secure watermarking and audit logs.
    • Assumptions/Dependencies: Organizational policy alignment; demographic bias audits for ArcFace; governance for reversibility and chain-of-custody.
  • Virtual try-on in video
    • Sector: Retail/Fashion, E-commerce
    • What: Replace face or apparel via edited reference images (e.g., outfit swaps) to showcase products in motion under real lighting.
    • Tools/Workflows: Product image editing with Flux/other image editors; MoCha inference; catalog-aware batching; conversion analytics.
    • Assumptions/Dependencies: Clothing dynamics not explicitly modeled; style/fit realism depends on reference quality and scene complexity; brand approvals.
  • Game machinima and mod content creation
    • Sector: Gaming, UGC
    • What: Swap character skins/NPC identities in recorded gameplay videos without per-frame masks.
    • Tools/Workflows: “Modder’s Studio” app; single-frame mask selection; multi-reference input; batch exports.
    • Assumptions/Dependencies: Game EULA/IP rights; compute availability; content disclosure for synthetic edits.
  • Personalized educational and corporate training videos
    • Sector: Education, Enterprise L&D
    • What: Replace presenters with localized avatars or inclusive representations while preserving gestures/facial expressions.
    • Tools/Workflows: Presenter Swap pipeline; TTS/voice dubbing; QA checks for motion smoothness and temporal flicker (VBench).
    • Assumptions/Dependencies: Compliance with institutional policies; lip sync handled externally; stakeholder acceptance and disclosure guidelines.
  • Shot continuity fixes and archival restoration
    • Sector: Post-Production, Restoration
    • What: Replace or adjust talent in archival or continuity-challenged footage while keeping original lighting and background intact.
    • Tools/Workflows: Single-frame mask workflows; curated reference retrieval; version control and editorial review.
    • Assumptions/Dependencies: Ethical review; rights clearance; model behavior on degraded footage.
  • Multi-character crowd edits for event videography
    • Sector: Event Production, Security
    • What: Replace or remove specific individuals in multi-person scenes with occlusions/interactions handled robustly.
    • Tools/Workflows: Per-subject single-frame mask; attention visualization for tracking; batch replacements with review gates.
    • Assumptions/Dependencies: Scalability when many subjects are edited; operational throughput; consent and privacy requirements.
  • Reproducible dataset pipelines for research and product teams
    • Sector: Academia, Applied Research
    • What: Adopt the UE5 rendered paired dataset, expression-driven animation (Flux + LivePortrait), and augmented video-mask pipelines to train/evaluate custom models.
    • Tools/Workflows: UE5 asset orchestration; procedural camera trajectories; YOLO-based filtering; ArcFace reward-based LoRA post-training.
    • Assumptions/Dependencies: Asset licensing (UE5/MetaHuman/Mixamo); reproducibility of composition scripts; balanced real/synthetic data to avoid artifacts.

Long-Term Applications

  • Real-time character replacement for live broadcasts and streaming
    • Sector: Media/Live Events
    • What: Near-real-time swaps during live shows, sports, or concerts with broadcast-grade stability.
    • Tools/Workflows: Model distillation/quantization; hardware acceleration; streaming inference servers; low-latency mask capture.
    • Assumptions/Dependencies: Significant model optimization; strict latency budgets; robust tracking under motion blur and lighting changes.
  • Personalized casting for on-demand entertainment
    • Sector: Streaming Platforms, Interactive Storytelling
    • What: Viewers choose who plays each role; platform renders personalized versions preserving scene dynamics and acting motion.
    • Tools/Workflows: Likeness licensing marketplace; scalable identity libraries; per-viewer batch pipelines; usage disclosures.
    • Assumptions/Dependencies: IP/rights frameworks; compute cost management; ethics and user safety policies.
  • Dynamic, contextual advertising overlays in live and recorded content
    • Sector: Advertising, Sports Media
    • What: Replace spokespersons, apparel, or branded items contextually while maintaining lighting and motion realism.
    • Tools/Workflows: Context-aware scheduling; sponsor asset libraries; broadcast integration; automated QC.
    • Assumptions/Dependencies: Regulatory constraints; content provenance/watermark requirements; high reliability under occlusions.
  • Licensed synthetic actors and consent-aware identity management
    • Sector: Legal/Policy, Creator Economy
    • What: Standardized workflows to license and insert digital likenesses with traceable provenance and disclosure.
    • Tools/Workflows: Consent registries; rights/usage contracts; embedded provenance signals; detection and audit services.
    • Assumptions/Dependencies: Harmonized regulations across regions; platform adoption; robust detection against misuse.
  • Telepresence and AR with motion-preserving avatars
    • Sector: Communications, AR/VR
    • What: Replace user identity in teleconferencing or AR video with stylized avatars that follow real motion and expressions.
    • Tools/Workflows: Single-mask capture at session start; device-side accelerated inference; avatar customization.
    • Assumptions/Dependencies: On-device compute or edge acceleration; integration with AR frameworks; acceptable latency and battery constraints.
  • Automated studio-grade post-production pipelines
    • Sector: VFX/Studios
    • What: End-to-end pipelines that include single-mask detection (SAM2), condition-aware RoPE orchestration, project-specific LoRA reward tuning, and automated QC.
    • Tools/Workflows: MLOps for model versioning; data curation; identity reward monitoring; failure mode dashboards.
    • Assumptions/Dependencies: Organizational MLOps maturity; standardized QC metrics; staff training and process integration.
  • Platform-scale privacy compliance for UGC
    • Sector: Social Platforms, Policy
    • What: Default anonymization options (face replacement) for minors or protected identities, preserving video utility.
    • Tools/Workflows: Consent-aware toggles; redaction-at-upload pipelines; provenance watermarking; audit trails.
    • Assumptions/Dependencies: False positive/negative management; user transparency and opt-in/out; evolving regional regulations.
  • Personalized learning content at scale
    • Sector: EdTech
    • What: Swap instructors with familiar local avatars across thousands of lessons, retaining gesture realism and classroom context.
    • Tools/Workflows: LMS integration; avatar libraries; automated dubbing/lip-sync; student-facing disclosure UI.
    • Assumptions/Dependencies: Acceptance by institutions; rigorous evaluation of cultural fairness; audio-visual syncing.
  • Forensic redaction that preserves scene semantics
    • Sector: Public Safety, Legal
    • What: Replace identities rather than blur to maintain context for analysis while respecting privacy.
    • Tools/Workflows: Evidence chain management; reversible transformations under court order; dual-provenance exports.
    • Assumptions/Dependencies: Legal standards and admissibility; strict access controls; bias and accuracy audits.
  • Cross-subject replacement beyond humans
    • Sector: Marketing, Conservation/Education
    • What: Replace non-human subjects (animals, products) in educational or promotional videos, keeping scene physics and lighting.
    • Tools/Workflows: Domain-specific reference collections; non-human identity encoders; QA for motion plausibility.
    • Assumptions/Dependencies: Training data coverage for non-human subjects; adaptation of identity reward signals; validation of ecological messaging accuracy.

Glossary

  • ArcFace: A deep face recognition method that produces discriminative embeddings for measuring identity similarity via cosine distance. "cosine similarity between the Arcface~\cite{deng2019arcface} embeddings of the generated video and the reference images."
  • Condition-aware RoPE: A customized extension of rotary positional embeddings that assigns coherent spatiotemporal indices to different input conditions (source, target, mask, references) to fuse them effectively. "we propose a condition-aware RoPE, an extension of the original 3D RoPE~\cite{rope, wan2025} mechanism."
  • Conditional Flow Matching (CFM): A training objective for flow-based generative models that aligns the model’s predicted velocity with a condition-dependent target slope. "We use the Conditional Flow Matching~\cite{CFM} loss to optimize the model:"
  • DiT (Diffusion Transformer): A transformer-based diffusion architecture that predicts denoising velocities or scores over latent representations. "During training, ztz_t is fed to a Transformer-based diffusion model~\cite{Peebles2022DiT} (DiT)."
  • Flux inpainting model: A generative inpainting system from the FLUX family used to replace foreground content in images. "We first collect a large set of film images and use the Flux inpainting model \cite{flux2024} to substitute the foreground character."
  • Flux Kontext: A FLUX-based latent flow-matching method for in-context image generation and editing, used here to augment poses of reference images. "we augment its pose with Flux Kontext \cite{batifol2025flux}."
  • In-Context Learning: Conditioning generation by concatenating inputs (e.g., source video, mask, references) so the model learns to use them implicitly during inference. "through efficient in-context learning by integrating the video content, frame mask, and reference character identity."
  • LivePortrait: A portrait animation technique that drives facial expressions to animate images. "Then we use LivePortrait \cite{liveportrait} to animate both images with the same facial-driven video."
  • LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that injects low-rank updates into existing model layers to adapt large models without full retraining. "we employ parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA)~\cite{hu2022lora} instead of full model tuning."
  • LPIPS (Learned Perceptual Image Patch Similarity): A perceptual similarity metric based on deep features, used to assess visual fidelity beyond pixel-wise errors. "we employ widely-used quantitative metrics, including SSIM~\cite{wang2004ssim}, LPIPS~\cite{zhang2018lpips}, and PSNR."
  • PSNR (Peak Signal-to-Noise Ratio): A reconstruction-quality metric that quantifies the ratio between signal power and noise power in generated outputs. "we employ widely-used quantitative metrics, including SSIM~\cite{wang2004ssim}, LPIPS~\cite{zhang2018lpips}, and PSNR."
  • Rectified Flow: A flow-based generative modeling framework where the model predicts velocities along a trajectory from noise to data, enabling efficient sampling. "MoCha builds on the pretrained text-to-video latent diffusion model using the Rectified Flow framework~\cite{flow-matching}."
  • Reward hacking: An RL failure mode where the model exploits the reward objective in unintended ways (e.g., copying references) rather than learning the desired behavior. "To avoid reward hacking that the model simply pastes the reference image into the generated video, we also use a pixel-wise Mean Squared Error (MSE) loss between the generated video and the GT video to provide dense supervision."
  • RoPE (Rotary Positional Embedding): A positional encoding technique that encodes positions via complex rotations, supporting extrapolation and better inductive biases; here used in 3D for spatiotemporal tokens. "However, a naive application of 3D Rotary Positional Embeddings (RoPE) by assigning a different temporal index to each condition would result in inflexible generation, such as being constrained to a fixed output length."
  • SAM2 (Segment Anything Model 2): A segmentation model that can produce masks in images and videos with minimal prompting. "For each video, we use SAM2~\cite{sam2} to generate a randomly selected frame mask."
  • SSIM (Structural Similarity Index Measure): An image similarity metric that captures structural changes in luminance, contrast, and structure. "we employ widely-used quantitative metrics, including SSIM~\cite{wang2004ssim}, LPIPS~\cite{zhang2018lpips}, and PSNR."
  • Unreal Engine 5 (UE5): A real-time 3D rendering engine used to synthesize high-fidelity paired video data with consistent scenes, motions, and lighting. "Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5)"
  • VAE (Variational Autoencoder): A probabilistic encoder–decoder model that compresses inputs into a latent space for generative modeling. "The input video V0V_0 will first be compressed to a latent z0z_0 through a video variational encoder~\cite{vae} (VAE)."
  • VBench: A comprehensive benchmark suite for evaluating video generative models across multiple dimensions (consistency, quality, motion, etc.). "we adopt the comprehensive evaluation suite from VBench~\cite{huang2024vbench} for a thorough comparison."
  • YOLOv12: An attention-centric, real-time object detector used here to filter non-human videos in dataset construction. "We apply a YOLOv12 detector \cite{tian2025yolov12} to filter non-human videos."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

  1. MoCha