Proportion-Based Motion Synthesis

Updated 4 February 2026

Proportion-based motion synthesis is a framework that generates temporally coherent motion by interpolating pre-learned motion primitives and latent vectors in controlled proportions.
It employs convex combinations and hierarchical UL/LL architectures, integrating techniques like softmax weighting and latent point clouds for robust motion commands.
The approach underpins practical applications in robotics and animation, achieving high precision in motion tasks while adapting to varying morphologies and sparse data settings.

Proportion-based motion synthesis encompasses algorithmic frameworks and neural architectures that generate temporally coherent motion by interpolating or combining pre-learned components—motion primitives, pose exemplars, point clouds, or latent vectors—in strict or adaptive proportions. This paradigm supports the synthesis of plausible, controllable actions for articulated robots, digital characters, or skeleton-agnostic representations. Recent advances have rigorously extended these principles to accommodate arbitrary morphologies, handle cross-domain retargeting, and increase robustness in data-sparse settings (Shu et al., 3 Feb 2026, Mo et al., 27 Jul 2025, Zhao et al., 2023).

1. Fundamental Models and Mathematical Formulation

Core to proportion-based motion synthesis is the direct generation of motion commands as convex combinations of basis motions or primitives, characterized by time-dependent proportion coefficients. For $N$ primitives $\{\varphi_1(t), \dots, \varphi_N(t)\}$ , the synthesized command at time $t$ is:

$x(t) = \sum_{i=1}^{N} p_i(t) \cdot \varphi_i(t)$

with $p_i(t) \geq 0, \sum_{i=1}^{N} p_i(t) = 1$ . In vectorized form, $x(t) = \Phi(t) p(t)$ where $\Phi(t) \in \mathbb{R}^{d \times N}$ and $p(t) \in \mathbb{R}^{N}$ (Shu et al., 3 Feb 2026). This framework yields guaranteed boundedness under interpolation and exposes controllable degrees of freedom via the proportions $p_i(t)$ .

In skeleton-agnostic settings, as in Temporal Point Cloud (TPC) architectures, a human motion of length $T$ is a sequence $\{\varphi_1(t), \dots, \varphi_N(t)\}$ 0 where each $\{\varphi_1(t), \dots, \varphi_N(t)\}$ 1 and $\{\varphi_1(t), \dots, \varphi_N(t)\}$ 2 (Mo et al., 27 Jul 2025). Here, spatial composition and sequence reconstruction are performed not over curated primitive sets but over identities parametrized through latent point vectors, to enable anatomy-agnostic proportion-based synthesis.

2. Hierarchical Control and Synthesis Architectures

A predominant approach utilizes hierarchical models with separation of planning and primitive generation. The upper layer (UL) executes long-horizon planning or proportion selection, either by outputting future follower state trajectories or explicit primitive mixture weights. The lower layer (LL) implements a suite of primitive networks (typically MLPs), each generating candidate motor commands for its corresponding motion primitive (Shu et al., 3 Feb 2026). Synthesis proceeds through:

UL reads the current state, updates plans or proportions every $\{\varphi_1(t), \dots, \varphi_N(t)\}$ 3 steps via LSTM or similar sequence models.
Every LL module produces a candidate primitive-wise command.
The commands are composed as per the current proportion vector.

This architecture supports three major proportion-model variants:

Model Type	Proportion Generation	Upper Layer Learning	Characteristics
Learning-based	Softmax over LSTM outputs	End-to-end trainable	Full flexibility, limited scalability
Sampling-based	Weighted MC samples (MPC)	Plans future follower states	Adaptable, MC-MPC integration
Playback-based	From stored demos	No UL learning	Fast, task-invariant, less reactive

(Shu et al., 3 Feb 2026)

Distinctly, PUMPS (Mo et al., 27 Jul 2025) employs an encoder-decoder architecture over TPCs, leveraging masked modeling in latent space, while Pose-to-Motion (Zhao et al., 2023) deploys skeleton-aware GANs where per-bone lengths directly modulate kernel responses, enabling on-the-fly adaptation to novel proportions.

3. Representational Methods for Proportion Robustness

Proportion-based synthesis requires representations allowing straightforward interpolation or recombination irrespective of underlying kinematic topology or bone lengths.

Motion primitives/commands: Defined as time sequences in joint, velocity, or torque space; suited for robotics contexts (Shu et al., 3 Feb 2026).
Temporal point clouds (TPCs): Frame-wise sets of 3D points (optionally grouped by body part), with unstructured sampling to support arbitrary morphological variation. TPCs permit latent factorization and network-based decoding for any skeleton (Mo et al., 27 Jul 2025).
Proportion-aware latent vectors: Key for models transferring motion to skeletons of drastically different proportions, achieved by introducing kinematic features (e.g., chain/bone lengths) to network layers (Zhao et al., 2023).

Skeleton normalization (e.g., scaling by global height $\{\varphi_1(t), \dots, \varphi_N(t)\}$ 4) and root-relative coordinate frames are critical preprocessing steps to ensure that synthesized motions are metrically consistent across characters of different sizes (Zhao et al., 2023).

4. Training Objectives and Optimization Protocols

Training in proportion-based frameworks bifurcates into primitive network optimization and global composition model learning:

LL primitives are first independently fitted by imitation to demonstration segments: $\{\varphi_1(t), \dots, \varphi_N(t)\}$ 5 (Shu et al., 3 Feb 2026).
UL optimization for learning-based methods seeks to minimize the discrepancy between compositional output and the demonstrated trajectory: $\{\varphi_1(t), \dots, \varphi_N(t)\}$ 6.
For sampling-based/MC-MPC models, cross-entropy weighting via cost-based softmax over sampled trajectories stabilizes synthesis under uncertainty.
In TPC latent-space models, linear assignment (Hungarian algorithm) is used for strict pointwise pairing in loss computation, preventing point collapse and guaranteeing identifiability across varying proportions and topologies (Mo et al., 27 Jul 2025).
Adversarial cycles, pose- or motion-level GAN losses, and end-effector contact/consistency regularizers are applied in complex cross-domain setups to enforce plausibility and proportional accuracy (Zhao et al., 2023).

Network structures employ multi-layer LSTMs/MLPs (UL/LL), Adam or AdamW for optimization, and batchwise latent masking for generalization.

5. Practical Applications and Evaluation

Proportion-based synthesis has been empirically validated in robotic manipulation and animation domains.

In a dual-object pick-and-place robot task, the method with 50 motion primitives (spanning spatial directions and segments) achieved 100% success rate on in-set motions for both sampling- and playback-based models. For complex out-of-set tasks, success rates were 70% (sampling) and 90% (playback), compared to 60% for baseline hierarchical IL (Shu et al., 3 Feb 2026). Placement errors on challenging tasks were $\{\varphi_1(t), \dots, \varphi_N(t)\}$ 710 cm on intermediate grasps and $\{\varphi_1(t), \dots, \varphi_N(t)\}$ 83 cm at final placement, limitations attributed to primitive coverage.

In skeleton-agnostic human motion, PUMPS achieves mean per joint position errors (MPJPE) of 38–73 mm and mean per joint velocity errors of 11–16 mm/s across categories, surpassing MixSTE, PoseFormer, and MotionBERT, and equalling MHFormer when finetuned (Mo et al., 27 Jul 2025). For motion denoising, it reduces MPJPE by $\{\varphi_1(t), \dots, \varphi_N(t)\}$ 925% over HuMoR and Pose-NDF and matches Laplacian smoothing and Transformer baselines.

The Pose-to-Motion framework demonstrated that with as few as 60 static target poses, high precision and recall in retargeted motion is obtained. User studies found that 78%–82% of participants preferred these results for perceptual realism and artifact reduction over established baselines (Zhao et al., 2023).

6. Comparative Characteristics and Trade-offs

Each composition and proportion-selection method exhibits distinct trade-offs:

Model Type	Flexibility	Scalability	Adaptability	Reactivity to Perturbations
Learning-based	High	Poor (large $t$ 0)	Low (task-specific)	Moderate
Sampling-based	Moderate–High	Good	High	High
Playback-based	Low (copying)	Excellent	Low (demonstration)	Low

(Shu et al., 3 Feb 2026)

Common limitations include restricted extrapolation when the target motion or pose lies outside the convex hull of the primitive/preset space, which leads to accumulation of positional errors. Enriching the diversity of primitives or point samples, and incorporating global scene/world models (e.g., VAEs, Transformers), are anticipated to improve coverage and generalization.

7. Future Directions and Extensions

The literature points to several promising directions:

Scene/world models: Adding variational autoencoders or Transformers at the UL level for compositional reasoning and global context modeling (Shu et al., 3 Feb 2026).
Universal motion priors: Masked latent pre-training (e.g., PUMPS paradigm) combined with strict pointwise assignment can provide generalizable bases for downstream tasks, even when little domain-specific motion exists (Mo et al., 27 Jul 2025).
Lightweight supervision: Pose-to-Motion demonstrates that proportional adaptation from static pose datasets is feasible for cross-domain retargeting, suggesting significant reduction in data requirements for motion synthesis in new morphologies (Zhao et al., 2023).
Robustness to topology changes: Unstructured representations (TPCs) and skeleton-aware differentiable architectures are key enablers for generalizing across arbitrary articulations.

A plausible implication is that as motion synthesis matures, future research may focus on hybrid models that combine primitive modulation, unstructured latent reasoning, and adaptive control, thereby achieving robust, universal motion strategies under extreme morphological and environmental variability.