CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset

Published 19 Jun 2025 in cs.CV, cs.AI, and cs.LG | (2506.16385v1)

Abstract: Micro-gesture recognition is a challenging task in affective computing due to the subtle, involuntary nature of the gestures and their low movement amplitude. In this paper, we introduce a Pose-Guided Semantics-Aware CLIP-based architecture, or CLIP for Micro-Gesture recognition (CLIP-MG), a modified CLIP model tailored for micro-gesture classification on the iMiGUE dataset. CLIP-MG integrates human pose (skeleton) information into the CLIP-based recognition pipeline through pose-guided semantic query generation and a gated multi-modal fusion mechanism. The proposed model achieves a Top-1 accuracy of 61.82%. These results demonstrate both the potential of our approach and the remaining difficulty in fully adapting vision-LLMs like CLIP for micro-gesture recognition.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CLIP-MG, a framework that fuses skeletal pose and RGB data to achieve fine-grained micro-gesture recognition via pose-guided semantic attention.
It employs a CLIP visual encoder and a 3D CNN-based pose encoder to generate semantic queries and gate multimodal features, boosting Top-1 accuracy to 61.82% on iMiGUE.
Ablation studies demonstrate that integrating pose guidance and cross-attention significantly improves performance, underscoring the value of modality-adaptive fusion in gesture analysis.

CLIP-MG: Pose-Guided Semantic Attention for Micro-Gesture Recognition

The paper "CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset" (2506.16385) investigates the problem of fine-grained micro-gesture classification in affective computing, targeting the subtle and involuntary nature of micro-gestures using multimodal data from the iMiGUE benchmark. The authors introduce CLIP-MG, a principled approach for integrating skeletal pose information with RGB frames to enhance the semantic focus and discriminative power of CLIP-based architectures for this challenging domain.

Overview of Methodology

The proposed system leverages both visual (RGB) and skeleton (pose) modalities using a combination of architectural elements:

CLIP Visual Encoder: A ViT-B/16 model, with most layers frozen, processes the RGB frames. Only the last two transformer blocks are fine-tuned.
Skeleton Encoder: Pose features are extracted using OpenPose, aggregated temporally, and rasterized as Gaussian heatmaps, followed by a 3D convolutional encoder.
Pose-Guided Semantic Query Generation: The skeleton features are employed to identify and weight visual tokens spatially and temporally, effectively highlighting regions close to active joints where micro-gestures are likely to occur.
Gated Multi-modal Fusion: Inspired by gated multimodal units, learned gates modulate both the semantic query and visual features, enabling adaptive reliance on either modality based on signal reliability.
Cross-Attention: A pose-guided semantic query is injected into the transformer’s last layer, enabling focused pooling over visual features informed by pose.
Classification Head: The final embedding is classified via a two-layer MLP.

Schematic of the Pipeline

1
2
3

[RGB Frames] ------> [CLIP ViT-B/16 Visual Encoder]                |
                                                                 |--> [Pose-Guided Semantic Query Gen] --> [Gated Fusion] --> [Cross-Attention over CLIP tokens] --> [MLP Classifier]
[Skeleton Sequences] --> [OpenPose Extraction + 3D CNN Encoder]  |

Empirical Results

On the iMiGUE dataset (33-class micro-gesture recognition), CLIP-MG delivers a Top-1 accuracy of 61.82%. This performance surpasses conventional single-modality baselines, such as ST-GCN (46.38%) and TRN (55.24%), and is competitive with Vision Transformer and 3D-CNN baselines operating on single modalities. However, CLIP-MG trails behind state-of-the-art multimodal fusion strategies from recent MiGA challenges, including prototype-based learning (70.25%) and advanced ensemble/fusion models (68.90–70.19%).

The ablation study demonstrates the critical impact of each model component:

Removing the pose branch leads to a drastic performance decrease (from 61.82% to 45.30%).
Excluding pose guidance from semantic query generation reduces accuracy by more than 10 percentage points.
Disabling cross-attention or gating mechanisms also causes notable drops, although gating contributes a relatively smaller gain.

Implications and Theoretical Significance

CLIP-MG’s framework offers a modular and interpretable method for tackling a core problem of action and affect recognition: extracting semantically-relevant, localized representations for subtle motion patterns often lost in global pooling or simple feature concatenation. By using pose signals as soft spatial-temporal priors, CLIP-MG addresses the ambiguity inherent in global visual embeddings for fine-grained gestures. The introduction of learnable, modality-adaptive gating is particularly practical, as it enables the model to fall back on the stronger modality under noise or occlusion scenarios typical in real-world deployment.

From a theoretical standpoint, the integration of semantic query generation inspired by SC-CLIP and gated fusion mechanisms provides a compelling approach for other domains with sparse, localized, or low-amplitude discriminative cues (e.g., low-level affective signals or pre-symptomatic movement disorders).

Limitations

Despite its interpretability and efficiency, CLIP-MG does not advance the state of the art in terms of raw performance. There are several contributing factors:

The cross-modal approach, while semantically focused, may under-exploit temporal dynamics, relying on static or averaged features.
The reliance on pre-trained and mostly frozen CLIP weights may bottleneck domain adaptation, given the domain gap between CLIP’s training data and micro-gesture distributions.
Skeleton feature extraction is dependent on OpenPose, which may introduce noise under occlusion or profile views, partially mitigated by the gating mechanism but not eliminated.

Broader Impact and Future Directions

The CLIP-MG methodology serves as a blueprint for future multimodal, explainable gesture and affective recognition systems, suggesting several directions:

Enhanced Temporal Modeling: Incorporation of sequence models (temporal transformers, recurrent networks) post-cross-attention could further exploit temporal cues.
Motion Magnification and Data Augmentation: Techniques such as Eulerian motion magnification may amplify subtle cues, supporting both skeleton estimation and visual attention.
Hybrid Training Paradigms: Leveraging weakly-supervised or contrastive pretraining with diverse gesture datasets could improve generalization and robustness.
Class Imbalance Mitigation: Leveraging prototype-based loss functions or class-balanced sampling may address long-tailed micro-gesture distributions.
Efficiency and Deployment: Exploring token-pruning, multimodal compression, and dynamic attention (e.g., TokenCarve and MADTP) are relevant for edge or real-time settings.

Conclusion

CLIP-MG demonstrates that pose-guided semantic attention can be effectively integrated with vision-LLMs for the challenging problem of micro-gesture recognition, providing a path towards more modular, interpretable, and domain-adaptive architectures. While not yet surpassing ensemble-based or heavily-tuned multimodal baselines, CLIP-MG's design choices clarify the relative value of semantic guidance, pose priors, and adaptive fusion—insights valuable for both practical application and further research in multimodal, fine-grained action understanding.

Markdown Report Issue