Gym FC: Form Correction Pipeline

Updated 7 February 2026

Gym FC is a pipeline for automated exercise form correction that uses domain-specific self-supervised pretraining and dual modalities to assess workout errors.
It employs pose synchronization and harmonic motion disentangling to overcome challenges like occlusion, viewpoint changes, and subtle form anomalies.
The approach achieves state-of-the-art performance on the Fitness-AQA dataset and generalizes across various exercises with minimal expert annotations.

Gym FC—here referring to Gym Form Correction pipelines for workout error detection—integrates domain knowledge-informed self-supervised learning, robust multi-modal representations, and a curated dataset for automated assessment of exercise form. The paradigm advances beyond conventional pose-based error detection by leveraging synchronized, exercise-specific video features and harmonically structured priors to address the challenges posed by occlusion, viewpoint changes, and subtle form anomalies in real gym environments (Parmar et al., 2022).

1. Self-Supervised Pretraining with Domain Knowledge

The Gym FC approach is predicated on exploiting two exercise-specific domain priors during pretraining:

Barbell “elevation” for pose synchronization: Synchronization is achieved by detecting and tracking the vertical coordinate of the barbell (or weight) in each frame using a YOLOv3 detector, yielding a normalized parabolic trajectory. This enables frame-to-frame correspondence across videos, views, and subjects at any given elevation.
Harmonic half-cycles for motion disentangling: The cyclical nature of most resistance exercises (e.g., squats, presses) is used to partition repetitions into two half-cycles, allowing separation of global (expected) from local (anomalous) motion.

Two self-supervised tasks operationalize these priors:

1.1 Cross-View Cross-Subject Pose Contrastive Learning (CVCSPC)

Triplet mining: For anchor frame $I_a$ $I_{a}$ at elevation $\epsilon$ $ϵ$ , construct:
- Positive: $I_+$ from any other video, $|\epsilon(I_+) - \epsilon(I_a)| \leq \Delta$ (with $\Delta$ small).
- Negative: $I_-$ with $|\epsilon(I_-) - \epsilon(I_a)| \geq \delta$ , typically $\delta=30^\circ$ .
Objective: Train ResNet-18 (ImageNet-initialized) with the “distance-ratio” contrastive loss:

$\mathcal{L}_{\text{pose}} = -\log \frac{\exp(-\|\phi_a-\phi_+\|_2)}{\exp(-\|\phi_a-\phi_+\|_2)+\exp(-\|\phi_a-\phi_-\|_2)}$

where $\phi_a$ , $\phi_+$ , $\phi_-$ are extracted feature vectors.

This self-supervision encourages pose-sensitive and nuisance-invariant representations, highly robust to viewpoint, clothing, and illumination variance (Parmar et al., 2022).

1.2 Self-Supervised Motion Disentangling (MD)

Triplet mining: From one normalized repetition, half-cycle $H_1$ is anchor, $H_2$ is negative, $H_+$ is an augmented version of $H_1$ (random spatial, appearance, temporal changes).
Time-reversal: Random reversal ensures all triplet members retain identical global harmonic structure; only local motion anomalies distinguish them.
Objective: Use R(2+1)D-18 backbone (Kinetics-initialized):

$\mathcal{L}_{\text{motion}} = -\log \frac{\exp(-\|\psi_1-\psi_+\|_2)}{\exp(-\|\psi_1-\psi_+\|_2)+\exp(-\|\psi_1-\psi_2\|_2)}$

This loss isolates subtle joint-level deviations (e.g., knee valgus) by discounting global body motion (Parmar et al., 2022).

2. Network Architectures and Training Protocols

The system employs a two-branch architecture, enabling both image-based and video-based representations:

Branch	Backbone	Input Size	Output Feature	Finetuning Strategy
CVCSPC	ResNet-18	Single frame (224×224)	$\phi \in \mathbb{R}^{512}$	Extract $\phi_t$ for each $t$ , aggregate with ResNet-1D over 200 frames, classify with FC or 1×1 conv
MD	R(2+1)D-18	16-frame clip per half-cycle	$\psi \in \mathbb{R}^{512}$	End-to-end on two half-cycles (32 frames), MLP head (512→256→classes)

Finetuning uses weighted cross-entropy loss, $w_c \propto 1/N_c$ , with class imbalance mitigation. Pretraining optimizers are ADAM, lr= $1\times10^{-4}$ , batch sizes 25 (CVCSPC), 5 (MD), with 100 (CVCSPC) and 20 (MD) epochs (Parmar et al., 2022).

3. Fitness-AQA Dataset and Error Taxonomy

The Fitness-AQA dataset underpins this pipeline:

Exercises: BackSquat (BS), OverheadPress (OHP), BarbellRow (BR).
Unlabeled subset: ≈10,000 in-the-wild video clips per exercise (YouTube/Instagram), each auto-trimmed to a single repetition.
Labeled subset: Expert annotations (two certified trainers), 70/15/15% train/val/test splits.
Taxonomy:
- BackSquat: Knees Inward (KIE), Knees Forward (KFE), Convex Rounded Back (CVRB), Concave Rounded Back (CCRB), Shallow Squat (SS).
- OverheadPress: Elbow Error, Knees Error.
- BarbellRow: Lumbar Angle Error, Torso-Back Angle Error.

Encoding each label as independent binary error detection.

4. Supervised Error Detection and Evaluation

Finetuning and evaluation leverage the learned representations for expert-labeled error detection:

BackSquat (controlled environment): CVCSPC (image-based) achieves 95.92% accuracy on KIE, outperforming 3D-pose models e.g., HMR-TDM (89.80%) (Parmar et al., 2022).
BackSquat (in-the-wild, imbalanced): CVCSPC (image) achieves F₁=0.5195 (KIE), surpassing baselines such as OpenPose-TDM (F₁=0.4143). The MD branch (video) attains F₁=0.8338 (KFE), best among comparable video SSL methods. Ensemble of CVCSPC and MD further increases F₁ to 0.5263 (KIE) and 0.8468 (KFE).
OverheadPress: MD achieves F₁=0.4552 (Elbow) and 0.8452 (Knee), state-of-the-art by available metrics.
Static single-frame errors (e.g., Shallow-Squat): CVCSPC yields F₁=0.8694; cross-exercise transfer (e.g., CVCSPC (SQ+OHP→BR)) yields F₁=0.6338 (Lumbar).

Weighted cross-entropy loss and explicit class balancing are essential for effective training under imbalanced error distributions.

5. Generalization and Transfer Across Domains

Representations learned via CVCSPC and MD demonstrate effective transfer:

Within Fitness-AQA, CVCSPC generalizes from BackSquat and OverheadPress pretraining to BarbellRow error detection (F₁ increases from 0.5760 to 0.6338 with multi-exercise source).
Cross-domain transfer is demonstrated on Olympic diving quality assessment (MTL-AQA), where MD achieves Spearman $\rho = 0.7763$ , outperforming prior self-supervised approaches and supervised baselines (Parmar et al., 2022).

A plausible implication is that harmonically-structured priors and pose-elevation synchronization may extend to other rhythmic, skill-based motion domains beyond strength training.

6. Significance Within Automated Human Movement Assessment

Gym FC, as instantiated by the domain knowledge-guided pipeline, yields:

Substantial improvements over open set 2D- or 3D-pose pipelines in cluttered, uncontrolled gym environments.
Robustness to viewpoint, clothing, and environmental factors due to task-aligned self-supervision.
Low annotation requirements, supporting small expert-annotated datasets with extensive unlabeled video for pretraining.
Cross-modal (image, video) and cross-exercise transfer, enabling a single architecture to support multiple exercise and domain error detection tasks.

By centering pretext tasks on exercise-specific physical priors, the framework shifts automated form-correction from generic pose tracking to context-sensitive, sensorimotor error recognition (Parmar et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gym FC.

Gym FC: Form Correction Pipeline

1. Self-Supervised Pretraining with Domain Knowledge

1.1 Cross-View Cross-Subject Pose Contrastive Learning (CVCSPC)

1.2 Self-Supervised Motion Disentangling (MD)

2. Network Architectures and Training Protocols

3. Fitness-AQA Dataset and Error Taxonomy

4. Supervised Error Detection and Evaluation

5. Generalization and Transfer Across Domains

6. Significance Within Automated Human Movement Assessment

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Gym FC: Form Correction Pipeline

1. Self-Supervised Pretraining with Domain Knowledge

1.1 Cross-View Cross-Subject Pose Contrastive Learning (CVCSPC)

1.2 Self-Supervised Motion Disentangling (MD)

2. Network Architectures and Training Protocols

3. Fitness-AQA Dataset and Error Taxonomy

4. Supervised Error Detection and Evaluation

5. Generalization and Transfer Across Domains

6. Significance Within Automated Human Movement Assessment

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research