Yoga-VI: View-Independent Pose Classification

Updated 22 December 2025

Yoga-VI is a view-independent framework that combines two-stage keypoint extraction with robust classification for precise yoga pose recognition.
It leverages a Random Forest classifier and fine-tuned contrastive language-image models to achieve high accuracy across diverse viewpoints.
The model demonstrates efficient training and inference, with scalable performance even on low-data regimes and stringent evaluation protocols.

Yoga-VI refers to “View-Independent Yoga Classification,” denoting a series of frameworks aimed at robust, generalizable yoga asana (posture) recognition in unconstrained visual scenarios. These frameworks leverage advances in human pose estimation, transfer learning, and, more recently, contrastive multimodal representation learning to address both view invariance and classification efficiency. Early Yoga-VI formulations deploy a modular decoupling of pose estimation and classification, while contemporary incarnations incorporate fine-tuned language-image models for scalable, data-frugal asana recognition (Chasmai et al., 2022, Dobrzycki et al., 13 Jan 2025).

1. Architectural Foundations

Two-Stage Keypoint-to-Class Pipeline

The initial formulation of Yoga-VI employs a two-stage pipeline: body keypoint extraction and multivariate asana classification (Chasmai et al., 2022). The body keypoints are extracted using a top-down pose estimation model (AlphaPose pre-trained on Halpe Full-Body), yielding 136 anatomical points including face, hands, and body landmarks. Post-extraction, dimensionality reduction is achieved by forming summary statistics over face and hands (top-10 confidences: mean, min, max) while retaining all 26 body keypoints, plus a normalized bounding-box aspect ratio. The resulting $\mathbf{x} \in \mathbb{R}^{71}$ feature vector undergoes normalization with respect to the human bounding box.

Classifier Model

The classifier $f : \mathbb{R}^{71} \to \{1, \dots, C\}$ is instantiated as a Random Forest comprising 500 fully grown decision trees, majority-voting via the Gini impurity criterion. Alternative classifiers (AdaBoost, LightGBM, Bagging) were empirically compared, but the Random Forest backbone sustained a favorable balance of classification accuracy and inference latency.

Multimodal Deep Architectures

The updated Yoga-VI frameworks utilize a contrastive language-image pre-training (CLIP) backbone, structuring the model as a two-stream architecture: fimg encodes RGB images, ftxt encodes pose-descriptive text prompts (“Image of a person doing the yoga pose <class>”), and cosine similarity in embedding space operationalizes class assignment (Dobrzycki et al., 13 Jan 2025). Fine-tuning is performed jointly on both encoders using a symmetric InfoNCE loss:

$\mathcal{L} = \frac{1}{2}\left(\mathcal{L}_{\mathrm{img}} + \mathcal{L}_{\mathrm{txt}}\right)$

where

$\mathcal{L}_{\mathrm{img}} = -\frac{1}{N}\sum_{i=1}^N \log\frac{\exp(\operatorname{sim}(v_i,t_i)/\tau)}{\sum_{j=1}^N \exp(\operatorname{sim}(v_i,t_j)/\tau)}$

and

$\mathcal{L}_{\mathrm{txt}} = -\frac{1}{N}\sum_{i=1}^N \log\frac{\exp(\operatorname{sim}(v_i,t_i)/\tau)}{\sum_{j=1}^N \exp(\operatorname{sim}(v_j,t_i)/\tau)}$

with $\operatorname{sim}$ denoting cosine similarity and $\tau$ the learned temperature.

2. Datasets and Preprocessing

In-House Yoga-VI Dataset

The foundational Yoga-VI dataset comprises synchronized video recordings of 51 subjects executing 20 asanas plus a “still” class, tracked with four static room-mounted cameras. This quadriview design facilitates analysis of view invariance, yielding 3,532 successful video captures and 72,000 uniformly sampled annotated frames across 12 final classes. Bilateral poses are split into left/right classes. Keypoint annotation is automated via AlphaPose, though ~2–3% potential mislabeling from transition frames is noted.

Public Yoga Pose Benchmarks

Benchmarking extends to:

Yadav et al.: 6 asanas, 15 subjects, single-view.
Jain et al.: 10 asanas, 27 subjects, single-view video; baseline 3D-CNN end-to-end recognition.
Yoga-82: 28,000 images representing 82 poses, highly imbalanced and variable in backgrounds and viewpoints (used for CLIP-based Yoga-VI (Dobrzycki et al., 13 Jan 2025)).

Multimodal Preprocessing

For CLIP models, preprocessing includes resizing, center cropping (224×224), and normalization using the ImageNet mean and standard deviation. No geometric augmentations are applied to avoid distorting pose configuration.

3. Evaluation Protocols and Generalization

Yoga-VI’s principal innovation lies in its rigorous three-way evaluation to quantify generalization (Chasmai et al., 2022):

Frame-wise (Random Split): Classical cross-validation across frames, prone to overestimated generalization due to temporal correlations (“target leakage”).
Subject-wise: Entire subjects held out from training; assesses cross-identity robustness.
Camera-wise (View Independence): Split by camera angle—train on $K\!-\!1$ viewpoints, test on the remainder, directly measuring view invariance.

Classification metrics include per-class Precision, Recall, and F1:

$\mathrm{Precision}_c = \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c},\quad \mathrm{Recall}_c = \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FN}_c},\quad F1_c = \frac{2\,\mathrm{Precision}_c\,\mathrm{Recall}_c}{\mathrm{Precision}_c+\mathrm{Recall}_c}$

Global accuracy aggregates all classwise true/false counts.

High accuracy under the camera-wise split operationally defines view-independence; significant drops indicate residual view dependence. Monotonic F1 degradation with fewer training-camera views (e.g., train on one, test on three: F1 ≈ 65.9%) substantiates the protocol’s rigor.

4. Comparative Analysis and Results

Experimental results delineate the strengths of each architectural generation:

Method	Framewise F1	Subjectwise F1	Camera-wise F1
Yoga-VI RF+AlphaPose	99.7%	97.99%	81.1% (3/1 split)
Kinect key-points	69.5%	57.9%	35.2%
CLIP (Yoga-82, top-1)	85.9%	—	—
CLIP (6-class, best)	99.1%	—	—
YOLOv8x-cls baseline	87.8%	—	—

Yoga-VI Random Forest with AlphaPose outperforms direct Kinect keypoints, highlighting the impact of deep transfer learning. CLIP-based Yoga-VI attains 85.9% top-1 on 82-class Yoga-82, surpassing DenseNet’s prior SOTA by ≈6%, and achieves near-perfect accuracy (99.1%) on six-class, low-data subsets. Training with as few as 20 images per pose yields ~90% accuracy. Compared to YOLOv8x-cls, CLIP matches or exceeds accuracy with 3.5x lower training time (≈14 minutes for CLIP, 48 for YOLOv8), and inference latency of ≈7.1 ms/image supports real-time deployment (Dobrzycki et al., 13 Jan 2025).

5. Strengths, Limitations, and Implementation Realities

Yoga-VI’s principal strengths are modularity, the ability to leverage transfer learning for limited asana data, and stringent, leakage-averse evaluation. By decoupling pose feature extraction from classification, the system can readily absorb advances in either subdomain. CLIP-based Yoga-VI demonstrates scalability to large-scale or few-shot tasks with minimal domain-specific data.

Key limitations persist. No explicit view-invariance loss is used; performance on unseen viewpoints drops by ~20% relative to seen views, especially for highly atypical or self-occluding asanas. The pose keypoint ground truth is automatically generated and unverified, conferring a minor error rate from ambiguous transition frames. For CLIP-based models, some architectural modifications (extra LayerNorm, attention pooling in ResNet variants) enhance stability but do not guarantee cross-domain robustness.

6. Extensions and Future Directions

Future avenues for Yoga-VI include:

Incorporating 3D pose reconstruction or domain adaptation to close the view invariance gap.
Expanding training datasets with more complex or rare asanas, leveraging synthetic data or generative models for augmentation.
Exploring end-to-end fine-tuning with view-independence regularizers.
Adapting prompt engineering and LoRA for rapid redeployment to novel posture classification domains (e.g., workplace ergonomics, dance, rehabilitation).
Hierarchical inference exploiting Yoga-82’s class structure for coarse-to-fine recognition.

The demonstrated correspondence between dense pose keypoints and SOTA asana recognition, along with CLIP’s competitive speed and training efficiency, positions Yoga-VI as an extensible, baseline-setting framework for multimodal human posture analysis in both research and applied contexts (Chasmai et al., 2022, Dobrzycki et al., 13 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (2)

A View Independent Classification Framework for Yoga Postures (2022)

Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Yoga-VI Model.