H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers

Published 8 Sep 2025 in cs.CV, cs.AI, and cs.LG | (2509.06956v1)

Abstract: Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H${2}$OT), for efficient transformer-based 3D human pose estimation from videos. H${2}$OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H$_{2}$OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.

Abstract PDF Upgrade to Chat

Summary

The paper introduces H2OT, a novel hierarchical framework that prunes and recovers pose tokens in video transformers for efficient 3D human pose estimation.
It employs a pyramidal token pruning strategy using methods like TPS and TRI, achieving up to 66 FPS with negligible accuracy loss.
The approach significantly reduces computational cost by lowering FLOPs by up to 57.4% while maintaining state-of-the-art performance.

Hierarchical Hourglass Tokenizer (H $_2$ OT): Efficient Pruning and Recovery for Video Pose Transformers

The paper introduces H $_2$ OT, a hierarchical, plug-and-play pruning-and-recovering framework designed to improve the efficiency of transformer-based 3D human pose estimation from videos. The approach systematically reduces computational redundancy in Video Pose Transformers (VPTs) by dynamically pruning pose tokens and subsequently recovering full-length sequences, enabling significant acceleration without compromising estimation accuracy.

Motivation and Background

Transformer-based architectures have established state-of-the-art performance in video-based 3D human pose estimation due to their capacity for modeling long-range dependencies. However, the quadratic complexity of self-attention with respect to the number of tokens (i.e., video frames) results in prohibitive computational costs, especially for long sequences (e.g., 243–351 frames). This computational burden limits the deployment of VPTs in resource-constrained environments.

Existing VPTs typically retain the full-length sequence throughout all transformer blocks, leading to redundant computation, as adjacent frames often contain highly similar pose information. Prior work (e.g., HoT) introduced a non-hierarchical hourglass paradigm for token pruning and recovery, but lacked a pyramidal, hierarchical design and imposed additional inference overhead.

H $_2$ OT Framework Overview

H $_2$ OT extends the hourglass paradigm by introducing a hierarchical, pyramidal pruning strategy, forming a "trophy-shaped" token flow through the network. The framework consists of two core modules:

Token Pruning Module (TPM): Progressively prunes pose tokens at multiple stages, selecting a small set of representative tokens to reduce redundancy.
Token Recovering Module (TRM): Recovers the full-length sequence from the pruned tokens, enabling dense 3D pose estimation for all frames.
Figure 1: Comparison of VPT paradigms: (a) rectangle (full sequence), (b) hourglass (single-stage pruning), (c) H $_2$ OT's hierarchical pyramidal pruning.

Figure 2: H $_2$ OT architecture overview, showing the integration of TPM and TRM within a standard VPT pipeline.

Token Pruning Module (TPM)

TPM is responsible for reducing the number of tokens in the intermediate transformer blocks. The hierarchical design prunes tokens at multiple depths, creating a pyramidal feature hierarchy. Four token selection strategies are explored:

Token Pruning Cluster (TPC): Clusters temporally pooled pose tokens using DPC- $k$ NN and selects cluster centers as representatives.
Figure 3: TPC architecture: spatial pooling, clustering, and selection of cluster centers as representative tokens.
Token Pruning Attention (TPA): Utilizes attention scores from the transformer to select tokens with the highest aggregate attention.
Token Pruning Motion (TPMo): Selects frames with the largest pose motion, targeting keyframes with significant movement.
Token Pruning Sampler (TPS): Uniformly samples tokens at fixed intervals, exploiting temporal redundancy for efficiency.

Empirical analysis demonstrates that TPS, when combined with efficient recovery, achieves the best trade-off between speed and accuracy, as it is parameter-free and preserves token order for fast interpolation.

Token Recovering Module (TRM)

TRM restores the full-length sequence required for dense 3D pose estimation. Two recovery strategies are proposed:

Token Recovering Attention (TRA): Employs a multi-head cross-attention layer, where learnable zero-initialized tokens attend to the representative tokens to reconstruct the full sequence.
Figure 4: TRA mechanism: learnable queries attend to representative tokens to recover full-length pose tokens.
Token Recovering Interpolation (TRI): Applies linear interpolation to the regressed 3D poses of the representative tokens, reconstructing the full sequence efficiently. TRI is particularly effective when used with TPS due to the ordered nature of sampled tokens.

Integration with VPT Pipelines

H $_2$ OT is designed to be model-agnostic and can be integrated into both seq2seq and seq2frame VPT pipelines:

Seq2seq: TPM prunes tokens after initial transformer blocks; TRM recovers the full sequence before the regression head, enabling efficient, dense 3D pose estimation.
Seq2frame: Only TPM is used; the center frame's token is always retained to ensure accurate regression for the target frame.
Figure 5: Standard VPT architecture, highlighting the insertion points for TPM and TRM.

Figure 6: H $_2$ OT applied to the seq2frame pipeline, with TPM selecting representative tokens and regression performed on the center frame.

Empirical Evaluation

Ablation Studies

Comprehensive ablations on Human3.6M demonstrate:

Token Pruning/Recovery: TPS+TRI achieves the highest efficiency (up to 66 FPS/frame, 0.00 relative time spent on pruning/recovery) with negligible accuracy loss (MPJPE increase <0.2mm).
Hierarchical Pruning: Multi-stage pruning (e.g., $r=[121,81]$ , $b=[0,3]$ ) yields the best trade-off, reducing FLOPs by 57.4% and doubling inference speed, with a 0.5mm improvement in MPJPE over the baseline.
Comparison with Flat Pruning: Hierarchical pruning outperforms single-stage (flat) pruning in both efficiency and accuracy.
Figure 7: Visualization of token selection patterns for different pruning strategies across video sequences.

Comparison with State-of-the-Art

H $_2$ OT, when integrated with leading VPTs (MHFormer, MixSTE, MotionBERT, MotionAGFormer), consistently reduces FLOPs by 36–63% and increases FPS by 44–88%, with no significant degradation in MPJPE. In some cases, accuracy is improved due to reduced overfitting to redundant frames.

Human3.6M: H $_2$ OT w. MixSTE achieves 40.5mm MPJPE (vs. 40.9mm for baseline) with 57.4% fewer FLOPs.
MPI-INF-3DHP: Comparable or improved PCK/AUC/MPJPE across all tested models.

Robustness and Generalization

Low-FPS Scenarios: H $_2$ OT maintains competitive performance even as input FPS is reduced, confirming its robustness to varying temporal resolutions.
Diffusion-based VPTs: H $_2$ OT is compatible with diffusion-based pose estimators, yielding similar efficiency gains.

Qualitative Analysis

Figure 8: Qualitative results on challenging in-the-wild videos, demonstrating accurate 3D pose estimation.

Figure 9: Failure cases in scenarios with occlusion, rare poses, or 2D detector errors.

Figure 10: Visualization of token selection and recovery: red frames/tokens are selected, gray are pruned, and 3D pose recovery is highlighted.

Implementation Considerations

Computational Requirements: H $_2$ OT is compatible with standard PyTorch-based VPT implementations. The hierarchical pruning and recovery modules are lightweight and introduce minimal overhead.
Hyperparameter Selection: The number and placement of pruning stages ( $r$ , $b$ ) can be tuned to balance speed and accuracy for specific deployment constraints.
Deployment: The framework is suitable for real-time applications and resource-constrained devices, given its substantial reduction in memory and compute requirements.

Theoretical and Practical Implications

H $_2$ OT demonstrates that full-length token retention in VPTs is unnecessary for accurate 3D pose estimation. Hierarchical token pruning, combined with efficient recovery, enables significant acceleration and memory savings. This paradigm is generalizable to other sequence modeling tasks where temporal redundancy is prevalent.

The findings challenge the prevailing assumption that longer input sequences always yield better performance in transformer-based video models. Instead, careful token selection and recovery can yield equivalent or superior results with a fraction of the computational cost.

Future Directions

Adaptive Pruning Policies: Learning data-dependent, task-specific pruning schedules could further optimize the trade-off between efficiency and accuracy.
Extension to Other Modalities: The hierarchical pruning-recovery paradigm may be applicable to video action recognition, video captioning, or other dense prediction tasks.
Integration with Efficient Attention Mechanisms: Combining H $_2$ OT with sparse or linear attention variants could yield further efficiency gains.

Conclusion

H $_2$ OT provides a principled, hierarchical approach to token pruning and recovery in video pose transformers, enabling substantial efficiency improvements without sacrificing accuracy. The framework is model-agnostic, robust across datasets and pipelines, and offers a practical solution for deploying high-performance 3D human pose estimation in real-world, resource-constrained environments.