Personalized Martial Arts Video Generation

Updated 12 January 2026

Personalized martial arts combat video generation is a synthesis task that creates custom multi-agent videos using tailored fighter identities, choreographed actions, and dynamic scenes.
State-of-the-art methods employ dual-adapter diffusion, multi-stage attention, and reinforcement learning to maintain temporal coherence and ensure visual fidelity.
Key challenges include preserving subject identity, reducing motion artifacts, and managing complex contact dynamics during fast-paced combat sequences.

Personalized martial arts combat video generation refers to the synthesis of high-fidelity, multi-subject video sequences in which specified combatants—typically defined via user-provided images or short videos—execute user-controlled martial arts actions within customizable environments. This task requires models to maintain subject-level appearance fidelity, precise interactive motion control, temporal coherence, and semantic consistency, with applications spanning digital entertainment, sports analytics, and embodied AI research.

1. Definition and Problem Formulation

Personalized martial arts combat video generation is a distinct video synthesis task characterized by the joint conditioning on:

Fighter identity (e.g., facial features, body type),
Interactive martial arts choreography (explicit pose sequences, text action scripts, or motion style exemplars),
Scene and background layout.

The formal input output structure, as instantiated in MagicFight (Huang et al., 5 Jan 2026), comprises:

Reference images $I_1, I_2$ for each fighter's appearance,
Sequence of paired skeletons $P = \{p_t^1, p_t^2\}_{t=1}^T$ or a choreography script describing combat actions,
Optional text prompt $T_s$ and background image $B$ ,
Output: $V = \{v_t\}_{t=1}^T$ —a video at 60 fps depicting both specified fighters executing the given choreography in a consistent scene.

Distinct challenges versus single-subject or single-action generation include severe risk of identity confusion (e.g., limb/texture swaps between fighters), anomalous limb artifacts (hallucinated or missing limbs, especially during kicks), and complex temporal contact dynamics (blocking, parrying, counter-attacks) that are not represented in solo or dance datasets (Huang et al., 5 Jan 2026).

2. Architectures and Key Methodological Variants

Current state-of-the-art approaches exhibit several architectural paradigms, each with mechanisms tailored to multi-subject personalization, motion style transfer, and temporal coherence.

2.1 Dual-Adapter Diffusion Models

VideoMage (Huang et al., 27 Mar 2025), DreamVideo (Wei et al., 2023), and related frameworks utilize subject-specific and motion-specific LoRA-style adapters within video diffusion backbones. The spatial LoRAs encode appearance identity, typically via textual inversion and adapter training on reference stills, while temporal LoRAs are trained disentangled from appearance using appearance-agnostic objectives and reference motion clips:

Subject LoRA: Inserted into spatial blocks (cross-attention, feedforward), trained on images, e.g., $\mathcal{L}_{\text{sub}}$ is a regularized MSE against target noise.
Motion LoRA: Inserted into temporal-attention blocks, trained on motion clips with appearance-agnostic guidance, i.e., $\mathcal{L}_{\text{mot}} = \mathbb{E}[\,\Vert \epsilon_{\theta+\Delta\theta_m}(x_{m, t}, c_m, t) - \epsilon_{\text{ap-free}} \Vert_2^2\,]$ , facilitating disentanglement.

At generation, spatial-temporal collaborative sampling (SCS) fuses identity and motion branches. Multi-subject scenarios initialize fused adapters with $\Delta\hat\theta_s=\frac{1}{N}\sum_{n}\Delta\theta_{s, n}$ and spatial attention regularization ensures subject-specific regions.

2.2 Multi-Stage Diffusion and Personalized Attention

MagicFight (Huang et al., 5 Jan 2026) implements staged training—Stage I on frames (freezing temporal blocks), Stage II on video clips (freezing spatial blocks)—with per-fighter identity encoders (ReferenceNet), pose guiding modules, and personalized spatial attention masks. Key formulas link input appearance and pose encodings to conditioned UNet attention blocks: $\bar O_{l,i} = \text{MaskAttn}(Q_l, K_i, M_i)\,V_i, \quad O_l = M_1\bar O_{l,1} + M_2\bar O_{l,2} + (1-M_1-M_2)\text{Attn}(Q_l,K_l)V_l.$ Additional modules (BackgroundCrafter) ensure dynamic environment support.

2.3 MLLM-Based Multi-Subject Guidance

CINEMA (Deng et al., 13 Mar 2025) replaces manual prompt-to-image correspondence by leveraging a Multimodal LLM (e.g., Qwen2-VL) to produce semantic tokens $s\in\mathbb{R}^{K\times d}$ incorporating fighter identity, action sequences, and inter-personal relations from interleaved image/text templates and action scripts. These tokens are mapped to the diffusion backbone's conditioning space via AlignerNet, then concatenated with visual entity codes.

Losses include a denoising score-matching loss, an explicit spatial-temporal consistency penalty, and MLLM alignment loss to ensure that multi-agent choreography and identity cues are accurately linked to the generated frames.

2.4 ControlNet-Driven and One-Shot Methods

PoseCrafter (Zhong et al., 2024) adapts Stable Diffusion+ControlNet pipelines with one-shot temporal attention tuning and affine latent editing of facial/hand landmarks, guided by flexible pose sequences. Key-frame insertion and latent smoothness regularization enhance identity retention and temporal fidelity, essential for fast and violent combat actions.

2.5 Reinforcement Learning for Physics-Based Avatars

MAAIP (Younes et al., 2023) applies multi-agent adversarial imitation learning to physics-based character simulation. Here, policies for each agent are trained with rewards given by motion and interaction discriminators, encouraging both style-correct movement and realistic, emergent interactivity between fighters, which can be rendered to generate 3D martial arts animations.

3. Datasets and Evaluation Protocols

The development of personalized martial arts combat generators was catalyzed by the KungFu-Fiesta (KFF) dataset (Huang et al., 5 Jan 2026), the first large-scale collection tailored to the domain:

540 videos, 180 unique identities, 120 action pairs, 20 environments, 704×512 px at 60 fps,
For each clip: paired identity stills, per-frame whole-body pose maps, and background frames.

For training, KFF enables robust modeling of two-person interactions with precise control over identity, motion, and context.

Evaluation metrics are multi-faceted:

Video realism: Fréchet Video Distance (FVD), NIQE, SSIM, PSNR, LPIPS.
Subject identity: CLIP-Image consistency, ArcFace-based similarity.
Motion/pose: pose-accuracy (joint error), action recognition accuracy, per-frame foot-ground contact consistency, Fréchet/optical flow statistics.
Temporal and dynamic fidelity: motion smoothness (temporal jerk), collision safety, temporal coherence loss.

User studies typically quantify perceived identity consistency, pose fidelity, and visual appeal on a 1–5 scale.

4. Losses, Training Strategies, and Combat-Specific Regularization

Advanced objective functions extend beyond the standard denoising score-matching loss:

Identity preservation: ArcFace or CLIP-based embedding losses on reconstructed frames,
Temporal coherence: penalties for frame-to-frame velocity/acceleration mismatches or foot-planting regularization,
Contact dynamics: collision-penalty or minimum distance rewards for reducing body interpenetration,
Semantic consistency: text-video CLIP similarity or simulated prompt augmentation (PersonalVideo (Li et al., 2024)) to maintain scene and choreographic alignment under novel prompts.

Reward-weighted regularization, e.g., $\mathcal{L}_{\text{total}}=\lambda_\text{diff}\mathcal{L}_{\text{diff}}+\lambda_{id}\mathcal{L}_{id} + ...$ , balances visual fidelity against accurate martial arts execution.

Prompt augmentation and multi-modal guidance (e.g., compositional prompts, 2D/3D skeleton encoding, or physics-informed discriminators) are universally recommended for capturing the complexity of combat (Deng et al., 13 Mar 2025, Fang et al., 2024).

5. Adaptation for Customization, Multi-Subject Fusion, and Robustness

For generalization to open-set identities, body shape mismatches, and long-form generation:

Spatial and temporal adapter fusion, as in VideoMage (Huang et al., 27 Mar 2025), allows incremental plugging-in of multiple character LoRAs.
Affine keypoint scaling corrects pose-shape inconsistencies (e.g., MagicFight Section 8, (Huang et al., 5 Jan 2026)).
Long video synthesis is achieved by latent “clip fusion”—overlapping the last frames of one segment with the start of the next to ensure temporal continuity.
For cases lacking sufficient customized video data, approaches like Still-Moving (Chefer et al., 2024) and PersonalVideo (Li et al., 2024) inject only spatial adapters and use “frozen videos” or reward supervision, respectively, maintaining both identity and motion priors.

Adversarial domain adaptation and mixture data finetuning (e.g., including two-person fashion runway data to improve detail) improve real/fantasy transfer, texture sharpness, and open-world generalization (Huang et al., 5 Jan 2026). MLLM-based pipelines (CINEMA) and autoregressive scene/interaction encoding permit scalable expansion to arbitrary numbers of combatants and choreography complexity (Deng et al., 13 Mar 2025).

6. Limitations, Open Problems, and Future Directions

Despite rapid progress, several technical and conceptual challenges remain:

Role confusion, particularly under occlusions or rapid motion, necessitates robust instance-level masking and stronger user-controlled correspondences.
High-speed, high-impact interactions (leg sweeps, mid-air collisions) tax the temporal resolution and may produce artifacts such as limb discontinuities or identity drift.
Real-world fine-tuning, domain transfer from synthetic (Unity) to authentic combat footage, and camera control (dynamic panning, cuts) are open areas (MagicFight, Future Directions).
Richer multi-character group interactions, fluid background/environment adaptation, and end-to-end differentiable rendering for 3D-2D hybrid workflows represent active frontiers (Huang et al., 5 Jan 2026).

7. Representative Frameworks: Comparative Summary

Framework	Key Customization Mechanisms	Multi-Subject Support	Motion Control	Notable Datasets
MagicFight (Huang et al., 5 Jan 2026)	ReferenceNet + PoseGuider + ID-attn	Yes (2 fighters, extendable)	Per-frame skeleton, text prompt	KungFu-Fiesta (540+ videos)
VideoMage (Huang et al., 27 Mar 2025)	Subject & Motion LoRA, SCS sampling	Yes	LoRA from reference video	-
CINEMA (Deng et al., 13 Mar 2025)	MLLM (Qwen2-VL)+AlignerNet+cross-attn	Yes (N subjects)	Action script, semantic tokens	-
PersonalVideo (Li et al., 2024)	ID-adapter + textual inversion + reward superv	Yes (multi-reference)	Simulated prompt, pose ControlNet	-
AnyCharV (Wang et al., 12 Feb 2025)	Stage I/II mask-guided, pose transformer	Yes	Pose heatmaps, collision loss	-
MAAIP (Younes et al., 2023)	Multi-agent GAN imitation (physics sim)	Yes	RL with style-conditional	Custom mocap + interaction

These methods highlight the growing sophistication and diversity of approaches—encompassing diffusion, transformer, and RL paradigms—for personalized, high-fidelity martial arts combat video generation grounded in rigorous empirical and theoretical developments.