Gym FC: Form Correction Pipeline
- Gym FC is a pipeline for automated exercise form correction that uses domain-specific self-supervised pretraining and dual modalities to assess workout errors.
- It employs pose synchronization and harmonic motion disentangling to overcome challenges like occlusion, viewpoint changes, and subtle form anomalies.
- The approach achieves state-of-the-art performance on the Fitness-AQA dataset and generalizes across various exercises with minimal expert annotations.
Gym FC—here referring to Gym Form Correction pipelines for workout error detection—integrates domain knowledge-informed self-supervised learning, robust multi-modal representations, and a curated dataset for automated assessment of exercise form. The paradigm advances beyond conventional pose-based error detection by leveraging synchronized, exercise-specific video features and harmonically structured priors to address the challenges posed by occlusion, viewpoint changes, and subtle form anomalies in real gym environments (Parmar et al., 2022).
1. Self-Supervised Pretraining with Domain Knowledge
The Gym FC approach is predicated on exploiting two exercise-specific domain priors during pretraining:
- Barbell “elevation” for pose synchronization: Synchronization is achieved by detecting and tracking the vertical coordinate of the barbell (or weight) in each frame using a YOLOv3 detector, yielding a normalized parabolic trajectory. This enables frame-to-frame correspondence across videos, views, and subjects at any given elevation.
- Harmonic half-cycles for motion disentangling: The cyclical nature of most resistance exercises (e.g., squats, presses) is used to partition repetitions into two half-cycles, allowing separation of global (expected) from local (anomalous) motion.
Two self-supervised tasks operationalize these priors:
1.1 Cross-View Cross-Subject Pose Contrastive Learning (CVCSPC)
- Triplet mining: For anchor frame at elevation , construct:
- Positive: from any other video, (with small).
- Negative: with , typically .
- Objective: Train ResNet-18 (ImageNet-initialized) with the “distance-ratio” contrastive loss:
where , , are extracted feature vectors.
This self-supervision encourages pose-sensitive and nuisance-invariant representations, highly robust to viewpoint, clothing, and illumination variance (Parmar et al., 2022).
1.2 Self-Supervised Motion Disentangling (MD)
- Triplet mining: From one normalized repetition, half-cycle is anchor, is negative, is an augmented version of (random spatial, appearance, temporal changes).
- Time-reversal: Random reversal ensures all triplet members retain identical global harmonic structure; only local motion anomalies distinguish them.
- Objective: Use R(2+1)D-18 backbone (Kinetics-initialized):
This loss isolates subtle joint-level deviations (e.g., knee valgus) by discounting global body motion (Parmar et al., 2022).
2. Network Architectures and Training Protocols
The system employs a two-branch architecture, enabling both image-based and video-based representations:
| Branch | Backbone | Input Size | Output Feature | Finetuning Strategy |
|---|---|---|---|---|
| CVCSPC | ResNet-18 | Single frame (224×224) | Extract for each , aggregate with ResNet-1D over 200 frames, classify with FC or 1×1 conv | |
| MD | R(2+1)D-18 | 16-frame clip per half-cycle | End-to-end on two half-cycles (32 frames), MLP head (512→256→classes) |
Finetuning uses weighted cross-entropy loss, , with class imbalance mitigation. Pretraining optimizers are ADAM, lr=, batch sizes 25 (CVCSPC), 5 (MD), with 100 (CVCSPC) and 20 (MD) epochs (Parmar et al., 2022).
3. Fitness-AQA Dataset and Error Taxonomy
The Fitness-AQA dataset underpins this pipeline:
- Exercises: BackSquat (BS), OverheadPress (OHP), BarbellRow (BR).
- Unlabeled subset: ≈10,000 in-the-wild video clips per exercise (YouTube/Instagram), each auto-trimmed to a single repetition.
- Labeled subset: Expert annotations (two certified trainers), 70/15/15% train/val/test splits.
- Taxonomy:
- BackSquat: Knees Inward (KIE), Knees Forward (KFE), Convex Rounded Back (CVRB), Concave Rounded Back (CCRB), Shallow Squat (SS).
- OverheadPress: Elbow Error, Knees Error.
- BarbellRow: Lumbar Angle Error, Torso-Back Angle Error.
Encoding each label as independent binary error detection.
4. Supervised Error Detection and Evaluation
Finetuning and evaluation leverage the learned representations for expert-labeled error detection:
- BackSquat (controlled environment): CVCSPC (image-based) achieves 95.92% accuracy on KIE, outperforming 3D-pose models e.g., HMR-TDM (89.80%) (Parmar et al., 2022).
- BackSquat (in-the-wild, imbalanced): CVCSPC (image) achieves F₁=0.5195 (KIE), surpassing baselines such as OpenPose-TDM (F₁=0.4143). The MD branch (video) attains F₁=0.8338 (KFE), best among comparable video SSL methods. Ensemble of CVCSPC and MD further increases F₁ to 0.5263 (KIE) and 0.8468 (KFE).
- OverheadPress: MD achieves F₁=0.4552 (Elbow) and 0.8452 (Knee), state-of-the-art by available metrics.
- Static single-frame errors (e.g., Shallow-Squat): CVCSPC yields F₁=0.8694; cross-exercise transfer (e.g., CVCSPC (SQ+OHP→BR)) yields F₁=0.6338 (Lumbar).
Weighted cross-entropy loss and explicit class balancing are essential for effective training under imbalanced error distributions.
5. Generalization and Transfer Across Domains
Representations learned via CVCSPC and MD demonstrate effective transfer:
- Within Fitness-AQA, CVCSPC generalizes from BackSquat and OverheadPress pretraining to BarbellRow error detection (F₁ increases from 0.5760 to 0.6338 with multi-exercise source).
- Cross-domain transfer is demonstrated on Olympic diving quality assessment (MTL-AQA), where MD achieves Spearman , outperforming prior self-supervised approaches and supervised baselines (Parmar et al., 2022).
A plausible implication is that harmonically-structured priors and pose-elevation synchronization may extend to other rhythmic, skill-based motion domains beyond strength training.
6. Significance Within Automated Human Movement Assessment
Gym FC, as instantiated by the domain knowledge-guided pipeline, yields:
- Substantial improvements over open set 2D- or 3D-pose pipelines in cluttered, uncontrolled gym environments.
- Robustness to viewpoint, clothing, and environmental factors due to task-aligned self-supervision.
- Low annotation requirements, supporting small expert-annotated datasets with extensive unlabeled video for pretraining.
- Cross-modal (image, video) and cross-exercise transfer, enabling a single architecture to support multiple exercise and domain error detection tasks.
By centering pretext tasks on exercise-specific physical priors, the framework shifts automated form-correction from generic pose tracking to context-sensitive, sensorimotor error recognition (Parmar et al., 2022).