FrankenMotion: Part-level Human Motion Generation and Composition

Published 15 Jan 2026 in cs.CV | (2601.10909v1)

Abstract: Human motion generation from text prompts has made remarkable progress in recent years. However, existing methods primarily rely on either sequence-level or action-level descriptions due to the absence of fine-grained, part-level motion annotations. This limits their controllability over individual body parts. In this work, we construct a high-quality motion dataset with atomic, temporally-aware part-level text annotations, leveraging the reasoning capabilities of LLMs. Unlike prior datasets that either provide synchronized part captions with fixed time segments or rely solely on global sequence labels, our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution. Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion, where each body part is guided by its own temporally-structured textual prompt. This is, to our knowledge, the first work to provide atomic, temporally-aware part-level motion annotations and have a model that allows motion generation with both spatial (body part) and temporal (atomic action) control. Experiments demonstrate that FrankenMotion outperforms all previous baseline models adapted and retrained for our setting, and our model can compose motions unseen during training. Our code and dataset will be publicly available upon publication.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an LLM-driven annotation pipeline and a hierarchical diffusion model that enable fine-grained, part-level human motion control.
The paper achieves superior performance with part-level R@1 of 47.21 and FID values between 0.04 and 0.06, outperforming current state-of-the-art baselines.
The paper demonstrates practical applications in AR/VR, digital animation, and embodied AI by enabling compositional, interpretable, and editable motion synthesis.

Part-level Hierarchical Control in Human Motion Generation: FrankenMotion

Motivation and Contributions

Text-conditioned human motion generation has advanced markedly, yet current models primarily operate at coarse granularity, lacking precise spatiotemporal and body-part control. This limitation stems from the absence of datasets containing temporally aligned, atomic part-level annotations, restricting models' ability to synthesize motions responsive to fine-grained user intent. "FrankenMotion: Part-level Human Motion Generation and Composition" (2601.10909) addresses this gap through two key contributions:

an automated pipeline for generating structured, temporally aligned part-level annotations using LLMs,
a novel transformer-based diffusion model enabling hierarchical conditioning at sequence, action, and part levels.

Dataset Construction via LLM Reasoning

The FrankenStein dataset is constructed by leveraging existing motion-language resources (KIT-ML, BABEL, HumanML3D) and deploying a custom LLM agent (FrankenAgent) to infer temporally and spatially granular labels from high-level action descriptions. The annotation pipeline operates at three levels: sequence-wide, coarse atomic actions, and body-part segments, generating temporally segmented, part-aware textual labels.

This process uses explicit LLM prompting to avoid annotation hallucination, producing "unknown" when the motion segment is ambiguous. The resulting dataset covers 39 hours of motion, with around 138k total labels—over 46k at the part level—and 28.8k new atomic annotations inferred by FrankenAgent. Human expert validation of the FrankenStein labels yields 93.08% accuracy and a Gwet’s AC1 reliability score of 0.91.

Figure 1: LLM-based annotation pipeline generating temporally and spatially granular part-level motion descriptions from coarse action-labeled data.

Hierarchical Diffusion Model for Motion Composition

FrankenMotion utilizes a transformer-based diffusion model supporting conditioning on multi-granular textual input: sequence, action, and part-level prompts. The pose space is based on SMPL parameters, joint positions, and velocities. The model fuses CLIP feature embeddings of all textual inputs, aligning temporal windows and body-part segments, then concatenates these with the motion representation and projects through an MLP to form the final conditioning token sequence.

Masking and stochastic dropout strategies are used during training to handle sparse or unknown annotations, increasing robustness under incomplete supervision. The model is optimized with a DDPM objective, predicting clean motion from noisy input and hierarchical text context. Training is executed for 47.5 hours on H100 GPUs using AdamW and frozen CLIP encoders.

Figure 2: Architecture of FrankenMotion, highlighting hierarchical input conditioning and transformer-based diffusion for multi-granularity motion generation.

Comparative Evaluation and Numerical Results

FrankenMotion is benchmarked against state-of-the-art baselines adapted for part-level control (STMC, UniMotion, DARTControl). Prior methods either lack end-to-end spatial-temporal reasoning, yield synchronized temporal intervals limiting flexibility, or do not support atomic part composition. FrankenMotion demonstrates consistent superiority by semantic alignment and realism scores in part, action, and sequence evaluations.

Specifically, in averaged part-level semantic correctness, FrankenMotion achieves R@1 = 47.21 (vs. UniMotion: 45.72, STMC: 40.67, DARTControl: 38.67), R@3 = 58.97, and M2T = 0.69. Realism metrics (Fréchet Inception Distance) at part/action levels show FID = 0.04–0.06 for FrankenMotion, compared to baselines ranging from 0.05 to 0.28. Notably, FrankenMotion claims the explicit ability to compose motions not observed during training and retain fine-grained control.

Figure 3: Qualitative comparison: FrankenMotion accurately composes multiple part-level controls and atomic actions; baselines fail to capture part composition and produce unnatural, repetitive behaviors.

Implications and Future Directions

The work establishes a robust pipeline for dataset expansion through LLM-based reasoning, enabling scalable annotation for fine-grained motion elements. The hierarchical model design provides modular and flexible user control, supporting editing at multiple temporal and spatial resolutions. Practically, FrankenMotion facilitates sophisticated motion generation tools for AR/VR, digital character animation, and embodied AI, especially in scenarios demanding precise part-level control and semantic coherence.

The structured fusion of multi-level textual conditioning advances motion synthesis towards compositional, interpretable, and editable frameworks. The limitation on minute-long sequence generation suggests future research into long-horizon modeling, possibly integrating memory architectures or hierarchical temporal abstractions.

Conclusion

FrankenMotion introduces a comprehensive solution to part-level, temporally aware human motion generation from text, comprising a large-scale LLM-annotated dataset and a transformer-based diffusion model for hierarchical composition. Through explicit spatial-temporal multi-level control, FrankenMotion achieves state-of-the-art accuracy and realism, sets a new baseline for controllable motion synthesis, and establishes a strong foundation for further research into compositional and interpretable motion generation models.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

FrankenMotion: Part-level Human Motion Generation and Composition

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper shows a new way to make 3D human animations from text descriptions, with very precise control over which body parts move and when they move. The authors build a special dataset and a model so you can tell the system things like “raise the left arm while sitting, then turn around,” and it generates a smooth, realistic motion that follows those instructions exactly.

What questions does it ask?

The researchers wanted to solve three simple but important problems:

How can we control individual body parts (like arms, legs, head, spine) with text, not just the whole body at once?
How can we control not only what moves, but also when it moves over time?
Can we mix and match small motion pieces (like “bend knees,” “turn head,” “step forward”) to create new, more complex motions the system hasn’t seen before?

How did the researchers do it?

They tackled the problem in two main steps: building a better dataset and creating a motion generator that understands three levels of instructions.

Building a better dataset (FrankenStein)

Most existing motion datasets have only big-picture labels, like “walking” or “sitting,” but not what each body part does at each moment. To fix that, the authors used an AI language assistant (FrankenAgent) to read existing motion descriptions and infer detailed, per-body-part actions with timing. Think of it like turning a rough script (“tie shoes”) into a detailed screenplay that says exactly when the spine bends, when the arms move, and for how long.

They:

Collected motions and their descriptions from well-known datasets.
Asked a strong LLM to break each sequence into smaller, timed parts for each body area (arms, legs, head, spine, and trajectory).
Told the AI to say “unknown” when it wasn’t sure, to avoid making things up.
Checked the quality with human experts, who agreed the new labels were correct about 93% of the time.

This created FrankenStein: a dataset that links motions to multi-level text labels with precise timing, from full-sequence summaries down to per-body-part actions.

The motion generator (FrankenMotion)

They built a model that composes motions from small building blocks, guided by text at three levels. You can think of it like directing a puppet show:

Sequence-level: the story of the whole scene.
Action-level: what happens in chunks of time (like “stand up,” then “turn,” then “walk”).
Part-level: per-frame instructions for specific body parts (like “left arm rises now,” “head turns now”).

Under the hood:

They use a transformer, a type of AI that’s good at reading sequences and finding patterns over time.
They use a diffusion process, which starts from noise and gradually “cleans it up” into a realistic motion—like sharpening a blurry picture step-by-step.
They turn text into numbers the model can understand using a text encoder (so the model can match “raise left arm” to motion patterns).
During training, they sometimes hide some text inputs on purpose (a “masking” trick) so the model learns to handle missing or partial instructions.

What did they find?

The new dataset and model make motion generation more accurate and more controllable than previous methods.

Better control: The model followed part-level and timing instructions more precisely than other systems they adapted for comparison.
More realism: The resulting motions looked smoother and more natural.
New combinations: The model could combine small motion pieces to make motions it never saw during training (for example, “sit down while raising the left arm”).
High-quality labels: Human reviewers found the dataset’s fine-grained, per-part, timed labels were correct about 93% of the time, with strong agreement among reviewers.

In simple terms: it both listened better to the instructions and moved more naturally.

Why does it matter?

This work makes it easier to create detailed, believable human animations from plain text. That helps:

Game developers and film/animation creators quickly craft complex character movements.
AR/VR experiences feel more responsive and lifelike.
Robotics and embodied AI learn precise, step-by-step human-like actions.
Researchers explore motion as a mix-and-match “language” of body-part actions, enabling more flexible and creative motion design.

Big picture: the paper turns motion generation into a kind of “motion Lego,” where small, timed, body-part actions can be composed into rich, realistic sequences guided by simple language.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper leaves the following gaps and open questions for future research:

Long-horizon generation: The model cannot produce minute-long sequences in a single pass; strategies for modeling and evaluating long-term temporal dependencies (e.g., hierarchical sampling, memory mechanisms, chunked diffusion with smooth stitching) are not explored.
Compositional generalization: Claims of composing unseen motions are not systematically validated; a benchmark that holds out specific cross-part/action combinations and measures performance on truly novel compositions is missing.
LLM annotation reliability at scale: Human auditing covers only 50 sequences; larger-scale, fine-grained validation of per-part temporal alignment and label correctness (including timing offset/error distributions) is needed, especially on edge cases (fast motions, simultaneous multi-part actions, complex transitions).
Cross-LLM robustness and calibration: FrankenAgent is instantiated with DeepSeek-R1 only; comparative studies across different LLMs (e.g., GPT-4, Claude, Llama-3) and calibration methods (uncertainty-aware outputs, ensemble agreement) are absent.
Definition and granularity of body parts: The paper does not specify the exact set of $K$ body parts; inclusion and impact of finer-grained parts (e.g., hands/fingers, facial features) and their labels on controllability and realism are not evaluated.
Physical plausibility and contact modeling: No explicit constraints or metrics for physical realism (e.g., contact consistency, foot sliding, joint limit violations, ground penetration) are reported; incorporating physics priors or post-hoc correction and evaluating with contact-based metrics remains open.
Multi-person and human–object interactions: The dataset/model focuses on single-person motion without objects; extending annotations and generation to interactions (contact events, affordances, relative motion constraints) and evaluating in such settings is unexplored.
Scene and trajectory constraints: Although “trajectory” is mentioned as a control signal, consistent global path following, obstacle avoidance, and scene-aware constraints (e.g., surfaces, stairs) are not modeled or evaluated.
Language coverage and robustness: The approach relies on English CLIP embeddings; generalization to non-English, code-switching, domain-specific jargon, and compositional linguistic structures (e.g., nested constraints, conditional instructions) is untested.
Conflict resolution for contradictory inputs: How the model handles conflicting or infeasible part/action/sequence prompts (e.g., “run while sitting”) and prioritizes constraints is unspecified.
Impact of sparse/missing labels: The masking strategy (Beta-dropout on known labels) is introduced but not ablated; sensitivity to varying rates/patterns of missingness and strategies for imputing/denoising unknown labels require study.
Text encoder and embedding choices: CLIP is frozen and part/action embeddings are PCA-reduced to 50D; ablations on encoder choice, dimensionality, and fine-tuning for motion semantics are missing.
Architectural alternatives for conditioning: The current design concatenates embeddings; exploring cross-attention, conditioning adapters, or gating mechanisms for part/action/sequence fusion and their effects is not investigated.
Evaluation metrics validity: Semantic correctness relies on learned evaluators trained on the same dataset and MPNet embeddings; independent human studies, fine-grained spatio-temporal alignment metrics (e.g., DTW per joint, per-part timing IoU), and physics-based metrics are lacking.
Baseline adaptation fairness: Baselines are modified to fit the task (e.g., merging texts), which may disadvantage them; stronger part-aware baselines or re-implementations designed for part-level control should be compared under matched training budgets.
Editing and controllability benchmarks: While qualitative editing is shown, quantitative evaluation protocols for per-part editing fidelity, minimality (edit locality), and user-in-the-loop control latency/accuracy are absent.
Efficiency and real-time usage: Training/inference costs (47.5 hours on H100; 100 diffusion steps) are reported without latency/throughput benchmarks; approaches for low-latency interactive control (e.g., LCMs, distillation) and their quality trade-offs need assessment.
Retargeting and rig generalization: The method is SMPL-centric; generalization to diverse skeletal rigs, retargeting quality, and cross-skeleton consistency metrics are not addressed.
Dataset scope and bias: FrankenStein covers ~39 hours; coverage of diverse motion genres (dance, sports, acrobatics), long-tail actions, and demographic/anthropometric variation is unclear; distribution analyses and bias audits are needed.
Annotation protocol transparency: Detailed prompt templates and alignment procedures are relegated to the supplementary; releasing full protocols, alignment heuristics, and failure case analyses would improve reproducibility and trustworthiness.
Diffusion schedule and objective choices: The model uses $x_0$ prediction with 100 cosine-scheduled steps; ablations on step counts, prediction parameterization (e.g., $\epsilon$ , $v$ ), guidance strategies, and their effects on controllability/realism are missing.
Safety and ethics: Potential biases introduced by LLM-generated labels, misuse risks (e.g., synthesizing harmful or deceptive motions), and guidelines for responsible dataset/model use are not discussed.

View Paper Prompt View All Prompts

Practical Applications

Overview

FrankenMotion introduces a diffusion-based, transformer motion generator with hierarchical control at sequence, action, and body-part levels, and FrankenStein, the first large-scale dataset with atomic, temporally aligned part-level text annotations produced via an LLM agent (FrankenAgent). The key innovations—fine-grained spatiotemporal control, compositional motion synthesis, and scalable LLM-assisted annotation—unlock practical applications across content creation, simulation, robotics, human factors, and research workflows.

Below are actionable use cases, grouped by deployment horizon, with sector links, potential tools/products/workflows, and feasibility assumptions or dependencies.

Immediate Applications

Text-to-motion authoring and editing for animation/VFX
- Sector: Media/Entertainment, Software
- Tool/product/workflow: Blender/Maya/Unreal plugins to generate or edit skeletal animation from sequence/action/part-level prompts; timeline UI to author part-specific constraints (e.g., “left arm waves from 1.0–2.2 s”).
- Assumptions/dependencies: Rig retargeting from SMPL to studio rigs; offline/near-real-time is acceptable; post-processing for foot-skate and contacts; GPU availability.
Rapid previsualization and storyboarding
- Sector: Film/TV, Advertising
- Tool/product/workflow: Previz storyboard tool where directors sketch global actions and refine with part-level prompts for key beats; export to DCC packages.
- Assumptions/dependencies: Non-real-time acceptable; human-in-the-loop refinement; proper licensing for trained models and datasets.
Game development prototyping for NPC behaviors and cutscenes
- Sector: Gaming
- Tool/product/workflow: Editor toolkit to produce cutscene clips and NPC idle/ambient animations by composing limb-specific instructions; integrate into motion matching systems.
- Assumptions/dependencies: Offline generation pipelines; engine integration (UE/Unity); careful quality control for transitions.
VTuber/virtual presenter gesture design
- Sector: Creator Economy, Live Streaming, Education
- Tool/product/workflow: Library of reusable gesture “macros” (part-level prompts) that can be triggered via text commands during streams or recording.
- Assumptions/dependencies: Latency may limit live real-time use; requires avatar retargeting and basic calibration.
Motion editing and mocap data augmentation
- Sector: Media/Entertainment, Sports Tech
- Tool/product/workflow: Patch/extend mocap by inserting atomic actions or editing specific limbs without re-shooting; generate long-tail variants for motion libraries.
- Assumptions/dependencies: Consistency checks for continuity and contacts; compatibility with existing mocap formats.
Human factors and ergonomics what-if testing
- Sector: Manufacturing, Workplace Safety
- Tool/product/workflow: Compose task variants (reach, bend, twist) to stress-test digital human models; evaluate posture risks.
- Assumptions/dependencies: Motions are not guaranteed physically accurate; may require physics or biomechanical validation layers.
Simulation assets for robotics imitation in sim
- Sector: Robotics
- Tool/product/workflow: Generate families of semantically labeled reference motions (with part constraints) to train imitation policies and curriculum learning.
- Assumptions/dependencies: Sim-to-real transfer gap; needs physics consistency and contact modeling if used directly for policy learning.
Kinesiology/biomechanics pedagogy
- Sector: Education, Academia
- Tool/product/workflow: Interactive web app demonstrating part-specific motion decomposition and timing; students edit body-part prompts and observe outcomes.
- Assumptions/dependencies: Educational use only; not a clinical tool.
Semantic motion retrieval and library organization
- Sector: Media/Entertainment, Software
- Tool/product/workflow: Index existing motion libraries using generated part/action/sequence annotations; query by fine-grained semantics (e.g., “right leg stutter step while turning left”).
- Assumptions/dependencies: FrankenAgent QA to minimize annotation hallucination; human spot checks.
LLM-assisted annotation to bootstrap new datasets
- Sector: Academia, R&D Labs, Data Operations
- Tool/product/workflow: FrankenAgent-style pipelines to create part-level, temporally aligned labels from legacy motion datasets; integrated uncertainty flags and human-in-the-loop QA.
- Assumptions/dependencies: Access to an LLM with comparable reasoning (e.g., DeepSeek-R1 or equivalent), robust prompt templates, and annotation governance.
Synthetic 3D motion for text-to-video training data
- Sector: Foundation Models, GenAI
- Tool/product/workflow: Render 3D characters driven by FrankenMotion under controlled part/action prompts to create high-quality text-aligned video for multimodal pretraining.
- Assumptions/dependencies: High-fidelity rendering pipeline; ethics review and watermarking; domain match to downstream tasks.

Long-Term Applications

Real-time, low-latency avatar control from language
- Sector: AR/VR, Social Telepresence
- Tool/product/workflow: On-device or cloud-in-the-loop controller that maps voice/text to continuous motions with part-level constraints, using distilled/accelerated diffusion (e.g., LCM variants) for sub-50 ms latency.
- Assumptions/dependencies: Model compression and inference optimization; robust safety filters; bandwidth constraints.
Whole-body humanoid robot control via compositional part commands
- Sector: Robotics, Industrial Automation
- Tool/product/workflow: Translate part-level natural language constraints into whole-body motion plans with contact/dynamics feasibility; integrate with MPC and safety monitors.
- Assumptions/dependencies: Physics-grounded controllers, contact planning, compliance control, and rigorous safety certification.
Multi-human and human–object interaction synthesis with scene constraints
- Sector: Gaming, Film, Robotics Simulation
- Tool/product/workflow: Extend control to interactions (grasp, handover) and multi-agent coordination, conditioned on object affordances and scene geometry.
- Assumptions/dependencies: New datasets with HOI/scene labels; contact and collision handling; improved physical plausibility.
Clinical-grade rehab and physical therapy guidance
- Sector: Healthcare
- Tool/product/workflow: Personalized exercise generation and progress tracking with part-level targets and timing (e.g., “increase shoulder abduction to 70° over 2 s”); tele-rehab supervision.
- Assumptions/dependencies: Clinical validation, biomechanics fidelity, regulatory clearance (e.g., MDR/FDA), patient safety and privacy compliance.
Workplace safety training simulators at scale
- Sector: EHS (Environment, Health & Safety), Insurance
- Tool/product/workflow: Generate parameterized task catalogs exploring hazardous motion variants (awkward lifts, slips) with annotated part-level risk factors for immersive training.
- Assumptions/dependencies: Physics-based risk estimation; expert-reviewed hazard models; content governance.
Autonomous driving and robotics perception simulation
- Sector: Mobility, AV/ADAS, Robotics
- Tool/product/workflow: Populate simulation with semantically diverse pedestrian/cyclist motions, composed at part-level for rare behaviors and edge cases.
- Assumptions/dependencies: Validation against real-world distributions; transferability of behavior statistics; simulation fidelity.
Inclusive and accurate sign language and gesture production
- Sector: Accessibility, Education, Media
- Tool/product/workflow: Controlled handshape, facial expression, and upper-body motions for sign language avatars; curriculum for gesture-rich presenters.
- Assumptions/dependencies: Specialized linguistic datasets and constraints (phonology, prosody); community co-design; high-fidelity finger/face rigs.
Industry standards for motion annotation and provenance
- Sector: Policy, Standards Bodies, Data Governance
- Tool/product/workflow: Define schema and best practices for atomic part-level annotations, LLM-assisted labeling protocols, audit trails, and uncertainty tagging.
- Assumptions/dependencies: Multi-stakeholder coordination (academia, studios, robotics firms); alignment with existing data standards.
Watermarking and content provenance for synthetic human motion
- Sector: Policy, Platforms, Media
- Tool/product/workflow: Motion-level watermarking to mark synthetic or edited sequences; platform-side provenance checks for uploads.
- Assumptions/dependencies: Technically robust watermarking for motion data; platform adoption; legal/policy frameworks.
Cross-modal agentic NPCs and training-by-description
- Sector: Gaming, Simulation, AI Agents
- Tool/product/workflow: LLM-driven agents that plan and execute actions by emitting part-level motion constraints in real time, coupled with environment feedback.
- Assumptions/dependencies: Fast inference, safety filters, and tight engine integration; sandboxing to prevent exploitative behaviors.
Long-horizon, multi-stage tasks with robust temporal planning
- Sector: Robotics, Training Simulators, Film/Games
- Tool/product/workflow: Author minute-long sequences with staged constraints (e.g., “approach, kneel, pick, stand, turn, place”) and smooth inter-stage transitions.
- Assumptions/dependencies: Advances in long-context modeling and memory; hierarchical controllers; temporal consistency across minutes.

Cross-cutting dependencies and assumptions

Retargeting and rig compatibility: Most workflows require mapping SMPL-style output to production rigs/avatars with consistent skeleton topology and scale.
Physical plausibility: Raw outputs may need physics correction for contact forces, balance, and constraints; crucial in robotics/ergonomics/clinical uses.
Dataset licenses and domain shift: Training on AMASS/BABEL/KIT-ML/HumanML3D-derived assets may have licensing constraints; performance may degrade on out-of-domain actions.
LLM-based annotation quality: FrankenAgent accuracy is high but not perfect; pipelines should include uncertainty reporting and human QA.
Compute and latency: The baseline diffusion pipeline (100 steps) is not real-time; deployment-grade systems need distillation, caching, or LCM-like acceleration.
Safety, ethics, and governance: Synthetic human motion entails risks (misuse, deepfakes, biased portrayals); adopt watermarking, provenance, and ethical review.

View Paper Prompt View All Prompts

Glossary

6D representation: A rotation encoding scheme that avoids discontinuities by representing 3D rotations with six parameters. "encoded using the 6D representation~\cite{zhou2019continuity},"
AdamW optimizer: A variant of Adam that decouples weight decay from the gradient update to improve generalization. "We use the AdamW optimizer~\cite{loshchilov2017decoupled} with a learning rate of $2\times10^{-4}$ and a batch size of 32."
Affordance reasoning: Inferring possible actions enabled by objects in an environment, used to guide interaction modeling. "Human–object interaction modeling~\cite{zhang2022couch,Zeng_2025_CVPR,xu2024interdreamer,wu2025human,zhang2024hoi,xu2025intermimic,xu2025interact,zeng2025chainhoi,li2023object} further introduces affordance reasoning and contact-based control,"
AMASS: A large unified motion capture dataset that standardizes human motion parameterization across sources. "AMASS later unified motion capture data from 15 publicly available mocap datasets into a consistent parameterization of human motion,"
Autoregressive: A modeling approach that predicts future outputs based on previously generated outputs. "DART~\cite{Zhao:DartControl:2025} employs a latent diffusion model for real-time, autoregressive motion generation,"
Beta distribution: A probabilistic distribution over [0,1] often used to sample masking probabilities in regularization. "We adopt Beta distribution to randomly decide the zero out probability $p$ of a body part text label $L_k^i$ : $p\sim \mathrm{Beta}(5r, 5(1-r))$ ,"
CLIP: A multimodal model that encodes text and images into a shared embedding space for alignment. "we use CLIP~\cite{radford2021learning} to extract text features for all input prompts."
Cosine noise schedule: A diffusion training schedule that varies noise levels following a cosine curve to improve sampling quality. "we employ a cosine noise schedule with 100 diffusion steps, as introduced by~\cite{chen2023importance}."
DDPM objective: The training loss used in Denoising Diffusion Probabilistic Models to recover clean data from noisy inputs. "We train our diffusion model $f_\theta$ parametrized by $\theta$ using the standard DDPM objective~\cite{Ho2020DDPM}:"
Diffusion model: A generative model that iteratively denoises samples from a noise distribution to produce data. "We adopt a transformer-based diffusion model as the framework for our text conditioned motion generation."
Diffusion timestep embedding: A learned representation of the current diffusion step injected into the model for conditioning. "The diffusion timestep embedding, after an MLP projection, is also added as a separate token."
Diversity (metric): A measure of variability across generated samples indicating non-redundant outputs. "realism, consisting of Fréchet Inception Distance (FID) and Diversity~\cite{Guo2022CVPR_humanml3d}."
Fréchet Inception Distance (FID): A metric comparing distributions of real and generated data via feature statistics to assess realism. "realism, consisting of Fréchet Inception Distance (FID) and Diversity~\cite{Guo2022CVPR_humanml3d}."
Gwet’s AC1 coefficient: A reliability statistic assessing inter-rater agreement less biased by prevalence than Cohen’s kappa. "We also report inter-annotator agreement using Gwet’s AC1 coefficient ( $AC_1$ )~\cite{gwet2001handbook} to assess the reliability of human evaluation,"
Hierarchical annotation: A structured labeling paradigm that spans sequence-, action-, and part-level descriptions over time. "This hierarchical annotation design enables a richer and more structured representation of human motion."
Hierarchical conditioning: Conditioning a generative model on multiple levels of semantic inputs (sequence, action, part) simultaneously. "learning to compose complex motions through hierarchical conditioning on part-, action-, and sequence-level text,"
Joint embedding: A unified representation space that fuses multiple modalities (e.g., text and motion) for learning alignment. "we use a joint embedding for sequence, action, part-level text and motion,"
Latent diffusion model: A diffusion process applied in a compressed latent space to improve efficiency and quality. "DART~\cite{Zhao:DartControl:2025} employs a latent diffusion model for real-time, autoregressive motion generation,"
Latent space: A learned, lower-dimensional representation where cross-modal alignment and generation are performed. "fine tune the pre-trained LLMs to align text and motion in the latent space."
LLMs: Scalable neural LLMs with strong reasoning and long-context capabilities used for annotation. "leveraging the reasoning capabilities of LLMs."
M2T: A motion-to-text alignment metric used to quantify semantic correctness of generated motions. "semantic correctness, consisting of R-Precision~\cite{Guo2022CVPR_humanml3d} and M2T~\cite{petrovich2024stmc},"
MPNet embeddings: Text representations from the MPNet model used to mitigate paraphrase-induced false negatives. "we follow~\cite{petrovich23tmr} and use MPNet~\cite{song2020mpnet} embeddings to remove false negative pairs due to paraphrased text labels"
Motion capture (mocap): The process and datasets capturing human movement for modeling and generation. "largely driven by the growing availability of motion capture (mocap) datasets and their corresponding textual annotations."
PCA (Principal Component Analysis): A dimensionality reduction technique applied to text features to control model complexity. "For action and part labels, we apply PCA to reduce the embedding dimension to $D=50$ ,"
Rotation-invariant representation: A representation that remains consistent regardless of global rotation, aiding robust pose modeling. "forming a rotation-invariant representation by defining $\mathbf{j}$ in a local coordinate frame aligned with the body."
R-Precision: A retrieval-based metric (R@K) that evaluates how well generated motions match their corresponding texts. "semantic correctness, consisting of R-Precision~\cite{Guo2022CVPR_humanml3d} and M2T~\cite{petrovich2024stmc},"
SMPL: A parametric 3D human body model that provides pose and shape parameters for motion representation. "using SMPL~\cite{smpl} pose parameters, joint positions, velocities and angular velocities:"
Stochastic masking: Randomly dropping conditioning inputs during training to improve robustness to sparse or missing labels. "This stochastic masking enhances robustness to incomplete conditioning and improves generalization under sparse supervision~\cite{Liu2019BetaDropout}."
Transformer-based diffusion model: A diffusion architecture that uses transformer layers to model spatiotemporal dependencies. "Our model is a transformer-based diffusion model that can be input conditioned on a) sequence level prompt, b) action-level prompt and c) part-level prompt."
ViT-B/32: A specific Vision Transformer variant used within CLIP as the text encoder backbone. "we adopt the frozen text encoder from CLIP (ViT-B/32)~\cite{radford2021learning}."

FrankenMotion: Part-level Human Motion Generation and Composition

Summary

Part-level Hierarchical Control in Human Motion Generation: FrankenMotion

Motivation and Contributions

Dataset Construction via LLM Reasoning

Hierarchical Diffusion Model for Motion Composition

Comparative Evaluation and Numerical Results

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does it ask?

How did the researchers do it?

Building a better dataset (FrankenStein)

The motion generator (FrankenMotion)

What did they find?

Why does it matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-cutting dependencies and assumptions

Glossary

Open Problems

Continue Learning

Authors (5)

Collections

Tweets

YouTube

FrankenMotion: Part-level Human Motion Generation and Composition

Summary

Part-level Hierarchical Control in Human Motion Generation: FrankenMotion

Motivation and Contributions

Dataset Construction via LLM Reasoning

Hierarchical Diffusion Model for Motion Composition

Comparative Evaluation and Numerical Results

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does it ask?

How did the researchers do it?

Building a better dataset (FrankenStein)

The motion generator (FrankenMotion)

What did they find?

Why does it matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-cutting dependencies and assumptions

Glossary

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets

YouTube