Spatiotemporal-Untrammelled Mixture of Experts for Multi-Person Motion Prediction

Published 25 Dec 2025 in cs.CV | (2512.21707v1)

Abstract: Comprehensively and flexibly capturing the complex spatio-temporal dependencies of human motion is critical for multi-person motion prediction. Existing methods grapple with two primary limitations: i) Inflexible spatiotemporal representation due to reliance on positional encodings for capturing spatiotemporal information. ii) High computational costs stemming from the quadratic time complexity of conventional attention mechanisms. To overcome these limitations, we propose the Spatiotemporal-Untrammelled Mixture of Experts (ST-MoE), which flexibly explores complex spatio-temporal dependencies in human motion and significantly reduces computational cost. To adaptively mine complex spatio-temporal patterns from human motion, our model incorporates four distinct types of spatiotemporal experts, each specializing in capturing different spatial or temporal dependencies. To reduce the potential computational overhead while integrating multiple experts, we introduce bidirectional spatiotemporal Mamba as experts, each sharing bidirectional temporal and spatial Mamba in distinct combinations to achieve model efficiency and parameter economy. Extensive experiments on four multi-person benchmark datasets demonstrate that our approach not only outperforms state-of-art in accuracy but also reduces model parameter by 41.38% and achieves a 3.6x speedup in training. The code is available at https://github.com/alanyz106/ST-MoE.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an adaptive MoE architecture that integrates diverse bidirectional spatiotemporal Mamba modules for motion prediction.
It achieves a 41.38% parameter reduction and 3.6x training speed-up, outperforming state-of-the-art models on multiple datasets.
The approach demonstrates robust scalability and real-world applicability by effectively modeling complex multi-person motion dynamics.

Spatiotemporal-Untrammelled Mixture of Experts (ST-MoE) for Multi-Person Motion Prediction

Motivation and Limitations of Prior Work

Multi-person motion prediction demands not only accuracy in forecasting future joint positions but also the ability to model intricate and heterogeneous spatiotemporal dependencies. Prior models predominantly leverage either fixed positional encoding or attention-based mechanisms for spatiotemporal representation. Techniques such as temporal/spatial positional encoding (TPE/SPE), trajectory-aware relative position encoding (TRPE), and self-attention improve temporal and spatial correlation learning, yet remain fundamentally rigid and restricted in expressive power. Furthermore, self-attention and concatenation operations impose quadratic computational complexity, severely impacting scalability and efficiency.

Figure 1: Overview of ST-MoE design and performance against conventional attention and encoding approaches on CMU-Mocap(UMPM); ST-MoE demonstrates optimal efficiency-accuracy trade-off via expert routing.

ST-MoE: Architecture Overview

The ST-MoE framework introduces an adaptive Mixture-of-Experts (MoE) architecture that circumvents the aforementioned limitations. Motion sequences, mapped from pose space to feature space via Discrete Cosine Transform (DCT) and a Multi-Pose Encoder, are input to an MoE layer equipped with dynamic, sparse expert selection. Crucially, each expert specializes in unique combinations of bidirectional spatial and temporal Mamba modules—state-space models designed to replace high-cost attention computations with linear-complexity selective scanning.

The routing mechanism, implemented via an MLP-based gating network, adaptively aggregates expert outputs based on input features, while sharing parameters across all experts to facilitate parameter economy. The final, aggregated representation is decoded back into pose space for future motion prediction.

Figure 2: ST-MoE system: adaptive input routing to diverse spatiotemporal experts constructed from bidirectional temporal and spatial Mamba modules, followed by prediction decoding.

Spatiotemporal Expert Design

Four distinct expert types are instantiated by varying the order of bidirectional spatial and temporal Mamba operations—ST, TS, SS, TT—each capturing different dependency configurations. This avoids the rigidity of prior positional encoding methods, enabling comprehensive modeling of complex motion dynamics. Bidirectionality—both forward and backward passes—in the Mamba modules ensures capture of global interdependencies in both spatial and temporal domains.

Empirical ablation studies confirm that heterogeneous expert fusion outperforms homogeneous expert selection, with optimal predictive accuracy achieved when all expert types are simultaneously activated.

Figure 3: Left: Prediction accuracy vs expert activation count; Right: Influence of MoE layer stacking depth—single-layer design yields best generalization.

Numerical Results and Efficiency

Extensive evaluation on CMU-Mocap (UMPM), Mix1, Mix2, and CHI3D datasets documents statistically superior performance compared to baseline models including IAFormer, TBIFormer, and JRFormer, as measured by both Joint Position Error (JPE) and Aligned Position Error (APE). Notable is the strong generalization of ST-MoE with increasing numbers of individuals, demonstrating robust scalability in complex, real-world scenes.

ST-MoE achieves a 41.38% reduction in parameter count and a 3.6x acceleration in training speed over the state-of-the-art IAFormer, attesting to the computational efficacy of linear-time Mamba blocks. These gains are obtained without sacrificing accuracy—average JPE and APE metrics are consistently lower across all datasets and time horizons, including authentic multi-person sequences from CHI3D.

Qualitative Analysis of Spatiotemporal Representation

Comparative visualizations highlight ST-MoE’s ability to model nontrivial transitions between dynamic and static motion with minimal drift, in contrast to attention-based methods (e.g., IAFormer) which exhibit error accumulation and poor transition modeling.

Figure 4: ST-MoE recovers complex, multi-person movements with high fidelity, closely matching ground truth sequences on CMU-Mocap (UMPM).

Further t-SNE analyses of expert feature spaces reveal distinctly separated clusters, evidencing the specialization and complementary nature of expert representations.

Figure 5: t-SNE visualization of expert-learned features: four distinct experts yield highly separable spatiotemporal embeddings on CHI3D and Mix2.

Adaptive gating weight visualizations demonstrate dynamic expert selection consistent with spatiotemporal complexity: static scenes primarily activate ST/TT experts, while spatially dynamic actions such as running allocate higher weights to SS/TS experts.

Figure 6: Gating weights are adaptively modulated based on spatiotemporal context, effecting specialized motion prediction across scenes.

Theoretical and Practical Implications

ST-MoE reconceptualizes multi-person motion prediction as an adaptive, expert-driven spatiotemporal modeling problem, leveraging state-space models for scalability and efficiency. It fundamentally addresses limitations in representational flexibility and computational cost found in prior attention/encoding paradigms.

This approach signals a trajectory toward lightweight, modular architectures—highly relevant for edge deployment, real-time inference, and cross-modal applications where low-latency is critical. The fusion of MoE and state-space modeling also opens prospects for scalable spatiotemporal understanding in domains beyond motion prediction, such as video understanding, behavior forecasting, and interactive autonomous systems.

Future Directions

Deterministic modeling constitutes ST-MoE’s primary focus; extension to stochastic multi-person motion prediction, integrating generative components for diversified trajectory synthesis, is a natural progression. Incorporating uncertainty quantification, conditional dynamics, and hierarchical temporal abstraction within the MoE framework can further improve adaptability to real-world, unpredictable motion patterns.

Figure 7: Additional qualitative comparisons highlight ST-MoE’s ability to capture fine-grained spatiotemporal details under modality and trajectory complexity.

Conclusion

ST-MoE introduces a formal, lightweight Mixture-of-Experts model equipped with bidirectional spatiotemporal Mamba modules for multi-person motion prediction. It achieves a strong balance between predictive accuracy and computational efficiency, validated across multiple standardized benchmarks. The model’s modular expert architecture, adaptive routing, and efficient state-space backbone have significant implications for spatiotemporal sequence modeling in applied and theoretical contexts. Further research into probabilistic modeling and broader domain adaptation is warranted.