Learning to score the figure skating sports videos

Published 8 Feb 2018 in cs.MM and cs.CV | (1802.02774v3)

Abstract: This paper targets at learning to score the figure skating sports videos. To address this task, we propose a deep architecture that includes two complementary components, i.e., Self-Attentive LSTM and Multi-scale Convolutional Skip LSTM. These two components can efficiently learn the local and global sequential information in each video. Furthermore, we present a large-scale figure skating sports video dataset -- FisV dataset. This dataset includes 500 figure skating videos with the average length of 2 minutes and 50 seconds. Each video is annotated by two scores of nine different referees, i.e., Total Element Score(TES) and Total Program Component Score (PCS). Our proposed model is validated on FisV and MIT-skate datasets. The experimental results show the effectiveness of our models in learning to score the figure skating videos.

Abstract PDF Upgrade to Chat

Citations (96)

View on Semantic Scholar

Summary

The paper introduces the Fis-V dataset and two deep architectures (S-LSTM and M-LSTM) for automatically scoring professional figure skating videos.
The S-LSTM model uses a self-attention mechanism to extract key clip-level features, while the M-LSTM model applies multi-scale convolutions with a skip LSTM to capture both local and global temporal dependencies.
Experiments demonstrate state-of-the-art performance with improved Spearman correlation and lower MSE over baselines, validating the effectiveness of handling long video sequences.

This paper addresses the task of automatically scoring figure skating sports videos. Unlike standard action recognition, this task requires understanding the entire video performance, which can be long (averaging 2 minutes and 50 seconds, around 4400 frames in their dataset), and predicting continuous scores based on expert judgment (referees). The scores reflect technical execution (Total Element Score - TES) and overall performance/interpretation (Total Program Component Score - PCS).

The authors highlight several challenges: the long duration of professional sports videos compared to typical action recognition clips, the need for expert-level scoring rather than crowdsourced labels, and the fact that scores are influenced by specific technical movements and the overall program presentation, requiring models to handle both local and global temporal information.

To tackle these challenges, the paper makes three main contributions:

Fis-V Dataset: A new, large-scale dataset for figure skating video analysis. It comprises 500 professional figure skating videos from high-standard international competitions (ISU Championships, Grand Prix, Olympics between 2012 and 2017). The videos are carefully selected to capture only the performance duration, pruning unrelated segments. Each video is annotated with both TES and PCS scores given by nine different international referees. The authors perform data analysis showing relatively weak correlations between TES and PCS across matches and players, suggesting they capture different aspects of performance. Fis-V is significantly larger than the existing MIT-skate dataset and provides richer annotations (TES and PCS separately vs. a single total score in MIT-skate). This dataset is intended to boost research in scoring professional sports videos.
Self-Attentive LSTM (S-LSTM): This model focuses on capturing local sequential information and identifying important video clips. Given a sequence of clip-level features (extracted using a pre-trained C3D model in this work), a self-attention mechanism (implemented as a 2-layer MLP) computes a weight matrix $A$ that assigns attention scores to each clip. This matrix is then used to compute a weighted sum of the clip features, resulting in a compact, fixed-length feature embedding $M$ . This embedding is passed through a standard LSTM and a fully connected layer to predict the scores. The model is trained using Mean Square Error (MSE) loss augmented with a penalty term $P = ||(AA^T - I)||_F^2$ to encourage the diversity of the learned attention patterns (i.e., ideally different rows of $A$ focus on different important aspects). The S-LSTM helps address the computational burden of processing very long sequences by summarizing important information.
Multi-scale Convolutional Skip LSTM (M-LSTM): This model is designed to capture both local and global sequential information efficiently from long videos. It uses multiple parallel 1D convolutional layers with different kernel sizes applied to the sequence of clip features. Small kernels capture short-term dependencies (local actions), while large kernels capture long-term dependencies (global performance flow). The output of these convolutional layers is then processed by LSTMs. To handle the long video sequences efficiently, the M-LSTM incorporates a revised skip LSTM structure. This skip LSTM uses a binary update gate $u_t \in \{0, 1\}$ to control whether the cell state and hidden state are updated at each time step, effectively skipping less significant information. The authors modify the update rules for $c_t$ and $h_t$ from standard skip LSTMs to prevent exposing outdated memory. The final hidden states from the LSTMs corresponding to different scales are concatenated and fed into a fully connected layer for score regression.

The framework predicts TES and PCS as two independent regression tasks, formulated as weakly labeled regression since only final scores are available, not scores for individual technical movements. The video features used are 4096-dimensional C3D features from the fc6 layer, pre-trained on Sports-1M. Clip-level features are extracted using a sliding window of 16 frames with a stride of 8.

For implementation, the models are built using PyTorch and optimized with Adam. They are trained for 250 epochs on one NVIDIA 1080Ti GPU, taking about 20 minutes per model. Data augmentation includes horizontal flipping. Dropout and batch normalization are used. The C3D features are used off-the-shelf without finetuning on the target datasets due to the computational cost and limited dataset size which could lead to overfitting.

Experiments are conducted on both the proposed Fis-V dataset and the existing MIT-skate dataset. The evaluation metrics are Spearman correlation ( $\rho$ ) and Mean Square Error (MSE). The proposed S-LSTM and M-LSTM models, both individually and combined (S-LSTM + M-LSTM), are compared against several baselines:

SVR with linear or RBF kernels applied to video features obtained via max or average pooling.
Standard LSTM and bi-directional LSTM models applied to C3D or SENet features.
Results from prior works [quality_action, hierarchical, parmar2017learning].

The results show that the proposed S-LSTM and M-LSTM models significantly outperform all baselines on both datasets, achieving the highest Spearman correlation and lowest MSE. The M-LSTM generally performs very well on its own. Combining S-LSTM and M-LSTM provides a further performance boost on MIT-skate and PCS prediction on Fis-V, suggesting the complementarity of modeling local (S-LSTM) and multi-scale local/global (M-LSTM) information. For TES prediction on Fis-V, the combination shows less improvement, which the authors hypothesize is because TES relies more on specific technical movements already well-captured by M-LSTM or S-LSTM alone, and combining might introduce redundancy.

The paper also includes an ablation study on the self-attention mechanism of S-LSTM, visualizing attention weights. Clips showing challenging technical movements received higher attention weights compared to less significant movements, validating the mechanism's ability to identify important segments.

In summary, the paper contributes a valuable new dataset and effective deep learning architectures (S-LSTM and M-LSTM) for the novel task of learning to score professional figure skating videos, demonstrating the importance of handling long video sequences and capturing both local technical details and global performance flow.

Markdown Report Issue