Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gym FC: Form Correction Pipeline

Updated 7 February 2026
  • Gym FC is a pipeline for automated exercise form correction that uses domain-specific self-supervised pretraining and dual modalities to assess workout errors.
  • It employs pose synchronization and harmonic motion disentangling to overcome challenges like occlusion, viewpoint changes, and subtle form anomalies.
  • The approach achieves state-of-the-art performance on the Fitness-AQA dataset and generalizes across various exercises with minimal expert annotations.

Gym FC—here referring to Gym Form Correction pipelines for workout error detection—integrates domain knowledge-informed self-supervised learning, robust multi-modal representations, and a curated dataset for automated assessment of exercise form. The paradigm advances beyond conventional pose-based error detection by leveraging synchronized, exercise-specific video features and harmonically structured priors to address the challenges posed by occlusion, viewpoint changes, and subtle form anomalies in real gym environments (Parmar et al., 2022).

1. Self-Supervised Pretraining with Domain Knowledge

The Gym FC approach is predicated on exploiting two exercise-specific domain priors during pretraining:

  • Barbell “elevation” for pose synchronization: Synchronization is achieved by detecting and tracking the vertical coordinate of the barbell (or weight) in each frame using a YOLOv3 detector, yielding a normalized parabolic trajectory. This enables frame-to-frame correspondence across videos, views, and subjects at any given elevation.
  • Harmonic half-cycles for motion disentangling: The cyclical nature of most resistance exercises (e.g., squats, presses) is used to partition repetitions into two half-cycles, allowing separation of global (expected) from local (anomalous) motion.

Two self-supervised tasks operationalize these priors:

1.1 Cross-View Cross-Subject Pose Contrastive Learning (CVCSPC)

  • Triplet mining: For anchor frame IaI_a at elevation ϵ\epsilon, construct:
    • Positive: I+I_+ from any other video, ϵ(I+)ϵ(Ia)Δ|\epsilon(I_+) - \epsilon(I_a)| \leq \Delta (with Δ\Delta small).
    • Negative: II_- with ϵ(I)ϵ(Ia)δ|\epsilon(I_-) - \epsilon(I_a)| \geq \delta, typically δ=30\delta=30^\circ.
  • Objective: Train ResNet-18 (ImageNet-initialized) with the “distance-ratio” contrastive loss:

Lpose=logexp(ϕaϕ+2)exp(ϕaϕ+2)+exp(ϕaϕ2)\mathcal{L}_{\text{pose}} = -\log \frac{\exp(-\|\phi_a-\phi_+\|_2)}{\exp(-\|\phi_a-\phi_+\|_2)+\exp(-\|\phi_a-\phi_-\|_2)}

where ϕa\phi_a, ϕ+\phi_+, ϕ\phi_- are extracted feature vectors.

This self-supervision encourages pose-sensitive and nuisance-invariant representations, highly robust to viewpoint, clothing, and illumination variance (Parmar et al., 2022).

1.2 Self-Supervised Motion Disentangling (MD)

  • Triplet mining: From one normalized repetition, half-cycle H1H_1 is anchor, H2H_2 is negative, H+H_+ is an augmented version of H1H_1 (random spatial, appearance, temporal changes).
  • Time-reversal: Random reversal ensures all triplet members retain identical global harmonic structure; only local motion anomalies distinguish them.
  • Objective: Use R(2+1)D-18 backbone (Kinetics-initialized):

Lmotion=logexp(ψ1ψ+2)exp(ψ1ψ+2)+exp(ψ1ψ22)\mathcal{L}_{\text{motion}} = -\log \frac{\exp(-\|\psi_1-\psi_+\|_2)}{\exp(-\|\psi_1-\psi_+\|_2)+\exp(-\|\psi_1-\psi_2\|_2)}

This loss isolates subtle joint-level deviations (e.g., knee valgus) by discounting global body motion (Parmar et al., 2022).

2. Network Architectures and Training Protocols

The system employs a two-branch architecture, enabling both image-based and video-based representations:

Branch Backbone Input Size Output Feature Finetuning Strategy
CVCSPC ResNet-18 Single frame (224×224) ϕR512\phi \in \mathbb{R}^{512} Extract ϕt\phi_t for each tt, aggregate with ResNet-1D over 200 frames, classify with FC or 1×1 conv
MD R(2+1)D-18 16-frame clip per half-cycle ψR512\psi \in \mathbb{R}^{512} End-to-end on two half-cycles (32 frames), MLP head (512→256→classes)

Finetuning uses weighted cross-entropy loss, wc1/Ncw_c \propto 1/N_c, with class imbalance mitigation. Pretraining optimizers are ADAM, lr=1×1041\times10^{-4}, batch sizes 25 (CVCSPC), 5 (MD), with 100 (CVCSPC) and 20 (MD) epochs (Parmar et al., 2022).

3. Fitness-AQA Dataset and Error Taxonomy

The Fitness-AQA dataset underpins this pipeline:

  • Exercises: BackSquat (BS), OverheadPress (OHP), BarbellRow (BR).
  • Unlabeled subset: ≈10,000 in-the-wild video clips per exercise (YouTube/Instagram), each auto-trimmed to a single repetition.
  • Labeled subset: Expert annotations (two certified trainers), 70/15/15% train/val/test splits.
  • Taxonomy:
    • BackSquat: Knees Inward (KIE), Knees Forward (KFE), Convex Rounded Back (CVRB), Concave Rounded Back (CCRB), Shallow Squat (SS).
    • OverheadPress: Elbow Error, Knees Error.
    • BarbellRow: Lumbar Angle Error, Torso-Back Angle Error.

Encoding each label as independent binary error detection.

4. Supervised Error Detection and Evaluation

Finetuning and evaluation leverage the learned representations for expert-labeled error detection:

  • BackSquat (controlled environment): CVCSPC (image-based) achieves 95.92% accuracy on KIE, outperforming 3D-pose models e.g., HMR-TDM (89.80%) (Parmar et al., 2022).
  • BackSquat (in-the-wild, imbalanced): CVCSPC (image) achieves F₁=0.5195 (KIE), surpassing baselines such as OpenPose-TDM (F₁=0.4143). The MD branch (video) attains F₁=0.8338 (KFE), best among comparable video SSL methods. Ensemble of CVCSPC and MD further increases F₁ to 0.5263 (KIE) and 0.8468 (KFE).
  • OverheadPress: MD achieves F₁=0.4552 (Elbow) and 0.8452 (Knee), state-of-the-art by available metrics.
  • Static single-frame errors (e.g., Shallow-Squat): CVCSPC yields F₁=0.8694; cross-exercise transfer (e.g., CVCSPC (SQ+OHP→BR)) yields F₁=0.6338 (Lumbar).

Weighted cross-entropy loss and explicit class balancing are essential for effective training under imbalanced error distributions.

5. Generalization and Transfer Across Domains

Representations learned via CVCSPC and MD demonstrate effective transfer:

  • Within Fitness-AQA, CVCSPC generalizes from BackSquat and OverheadPress pretraining to BarbellRow error detection (F₁ increases from 0.5760 to 0.6338 with multi-exercise source).
  • Cross-domain transfer is demonstrated on Olympic diving quality assessment (MTL-AQA), where MD achieves Spearman ρ=0.7763\rho = 0.7763, outperforming prior self-supervised approaches and supervised baselines (Parmar et al., 2022).

A plausible implication is that harmonically-structured priors and pose-elevation synchronization may extend to other rhythmic, skill-based motion domains beyond strength training.

6. Significance Within Automated Human Movement Assessment

Gym FC, as instantiated by the domain knowledge-guided pipeline, yields:

  • Substantial improvements over open set 2D- or 3D-pose pipelines in cluttered, uncontrolled gym environments.
  • Robustness to viewpoint, clothing, and environmental factors due to task-aligned self-supervision.
  • Low annotation requirements, supporting small expert-annotated datasets with extensive unlabeled video for pretraining.
  • Cross-modal (image, video) and cross-exercise transfer, enabling a single architecture to support multiple exercise and domain error detection tasks.

By centering pretext tasks on exercise-specific physical priors, the framework shifts automated form-correction from generic pose tracking to context-sensitive, sensorimotor error recognition (Parmar et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gym FC.