Event SkiTB: Benchmark for Alpine Sports Tracking
- Event SkiTB is a dual-modality benchmark and synthetic dataset designed to evaluate skier tracking using both event-based vision and traditional RGB methods.
- The dataset is derived from multi-camera SkiTB broadcasts and rigorously simulates challenging alpine conditions with metrics like mIoU, CLE, and F1 score.
- It supports advanced neuromorphic and transformer-based tracking approaches, demonstrating significant gains in handling rapid motion and static broadcast clutter.
Event SkiTB (eSkiTB) denotes both a standardized challenge for benchmarking visual tracking in alpine sports and a synthetic event-based dataset designed expressly for evaluating skier tracking under broadcast conditions featuring substantial visual clutter, abrupt motion, and multi-camera scenarios. The eSkiTB corpus, tooling, and protocol address limitations of traditional RGB tracking by leveraging event-based vision—enabling robust evaluation and development of both neuromorphic and transformer-style tracking algorithms (Vinod et al., 10 Jan 2026). The eSkiTB initiative includes the original SkiTB challenge, additional synthetic datasets, and a suite of evaluation metrics, serving as a critical benchmark for the winter-sport tracking community (Penta et al., 26 Feb 2025, Li et al., 28 Feb 2025).
1. Origins and Motivation
Visual tracking in alpine sports is impeded by characteristics such as rapid subject motion (velocities exceeding 100 km/h), occlusion, frequent scale changes, camera switches, and static broadcast overlays (e.g., scoreboards, sponsor banners). Conventional RGB-based trackers are often brittle under these conditions, given motion blur, environmental variability, and visually homogeneous backgrounds. Event-based cameras, which asynchronously capture only changes in brightness at microsecond precision, present advantages for isolating ballistic motion amidst static visual clutter.
The eSkiTB (Event SkiTB) project was conceived to furnish the first controlled, iso-informational benchmark for evaluating skier tracking using both conventional and event-based modalities. The challenge complements earlier competitions focused on RGB tracking and seeks to catalyze advances in both hardware (event cameras) and algorithmic methodologies (transformers, spiking networks).
2. Dataset Generation and Composition
2.1. Original and Synthetic Data
The underlying source material is the SkiTB RGB broadcast collection, originally containing 300 sequences (alpine, freestyle, ski jumping), resolutions up to 1280×720 px, frame rates from 25–60 Hz, totaling approximately 235 minutes. Each video may comprise up to three synchronized camera views with challenging multi-view transitions and environments (Vinod et al., 10 Jan 2026).
Synthetic event streams are generated directly from RGB frames using a v2e simulation pipeline that applies the physical event-generation model:
with contrast thresholds both set to 0.2, ensuring triggering only on significant temporal luminance changes at the pixel level. This iso-informational constraint deliberately excludes neural interpolation, maintaining a strict correspondence of information across modalities.
2.2. Dataset Properties
- 300 sequences (240 train, 30 validation, 30 test), average sequence length ≈1,176 frames (≈20s at 60 Hz)
- Peak event rate above 10×10⁶ events/s; box area spanning five orders of magnitude
- Annotations: Coarse (single RGB box per frame); Dense (ms-level interpolation via cubic splines) for continuous trajectories
- Broadcast artifacts (scoreboards, text overlays, fences) are preserved in the event data, but static elements naturally result in zero events—filtering clutter without explicit pre-processing
- Motion regimes include static pan/tilt, aggressive camera panning, and multi-view segments
3. Benchmarking Protocol and Evaluation Metrics
3.1. Tracking Tasks
Three main challenges are codified:
- Multi-camera integration: Maintaining object identity across abrupt view switches and large positional jumps
- Scale variation handling: Robustness to rapid changes in perceived skier size and orientation relative to cameras
- Rapid motion: Accurate box localization during accelerations, jumps, and complex maneuvers
3.2. Metrics
Explicit evaluation metrics, applied per sequence and discipline, include:
- Mean Intersection-over-Union (mIoU):
- Center Location Error (CLE):
- CLEAR MOT Metrics (adapted to single-object): IDF1, MOTA, MOTP
- F₁ score as with P (precision), R (recall) determined at IoU=0.5
- OPE (One Pass Evaluation): Mean IoU, Precision@20 px, [email protected] IoU, phased across the test split and clutter-specific subsets
3.3. Submission Protocol
Participants submit inference code (Docker/Conda environment), produce per-frame CSV outputs of predicted boxes and confidences, and are required to register and abide by the challenge rules, including exclusive use of SkiTB data and strict server-side test evaluation (Penta et al., 26 Feb 2025).
4. Algorithmic Baselines and Innovations
4.1. RGB-based Transformer Trackers
STARK (Spatio-Temporal Transformer for Visual Tracking) serves as a primary baseline. Its pipeline comprises a ResNet50 backbone for feature extraction, dual template embedding (initial and dynamic), a concatenated transformer encoder with sinusoidal positional encoding, and a cross-attention decoder that outputs bounding box offsets and confidences. Losses include joint regression (GIoU, ) and confidence scoring (cross-entropy), with hyperparameters such as dynamic search factor and incremental template update controlled by thresholds on score and box area.
Fine-tuned and ski-specific versions of STARK achieve mean IoU up to 0.829 (ski-specific), with an overall F1 up to 0.805. Most significant performance gains are observed on Alpine and Jumping disciplines (+28–30% F1 over generic baselines). Ablations show incremental benefits from hierarchical score heads, dynamic search, and template update mechanisms (Penta et al., 26 Feb 2025).
4.2. Event-based Neuromorphic Trackers
SDTrack is a spiking transformer model for event data, employing a global trajectory prompt for object permanence, spiking MLP layers sensitive to temporal contrast, and voxel-grid (128×128) discretization of events. With domain-specific fine-tuning on eSkiTB event sequences, SDTrack achieves mean IoU = 0.711, outperforming generic RGB baselines especially under broadcast clutter (+20 points IoU in high-static-overlay scenarios) (Vinod et al., 10 Jan 2026).
4.3. Hybrid and Re-Identification-Enhanced Trackers
ReID-SAM integrates the SAMURAI (Segment Anything Model for zeroshot tracking) pipeline with an OSNet-based person re-identification branch and equipment-aware detection (YOLOv11, STARK detector for multi-skier cases). Identity-switches are corrected via cosine similarity of appearance embeddings, and equipment tracking is smoothed with Kalman filtering. ReID-SAM achieves F1 = 0.870 overall, with per-discipline highs of 0.903 (Alpine) and 0.919 (Jumping), surpassing both RGB and prior hybrid baselines (Li et al., 28 Feb 2025).
5. Comparative Performance and Analysis
The following table summarizes mean IoU (and key metrics where available) across representative trackers and datasets:
| Tracker | Mean IoU | Precision@20 px | [email protected] IoU |
|---|---|---|---|
| STARK (generic) | 0.512 | 0.567 | 0.568 |
| STARK (fine-tuned) | 0.795 | 0.847 | 0.904 |
| STARK (ski-specific) | 0.829 | 0.887 | 0.935 |
| SDTrack (pretrained) | 0.312 | 0.354 | 0.418 |
| SDTrack (fine-tuned) | 0.711 | 0.720 | 0.873 |
In high-clutter (static overlay ≥50%) subsets, SDTrack outperforms RGB-based STARK by +20 IoU points (~0.685 vs. ~0.485 IoU). By discipline, SDTrack attains up to 0.974 IoU in Ski Jumping (exceeding STARK_ski by +5.9 points), indicating the value of event-based tracking for ballistic, rapidly moving targets.
Failure modes include (i) domain misadaptation for SDTrack in non-tuned settings (IoU = 0.312), (ii) aliasing in low-resolution voxel grids affecting freestyle sequences (SDTrack: 0.569 vs. STARK_ft: 0.728), and (iii) continued identity-switches in RGB-only trackers during heavy occlusion or multi-skier overlaps (Vinod et al., 10 Jan 2026, Penta et al., 26 Feb 2025, Li et al., 28 Feb 2025).
6. Impact, Insights, and Future Developments
eSkiTB establishes the first controlled, iso-informational cross-modality benchmark for winter sport tracking and demonstrates that event-based approaches fundamentally improve robustness to visual clutter. Key insights include:
- Static overlay filtering: Event streams eliminate fixed background artifacts without explicit masking, unlike RGB pipelines sensitive to false positives from such overlays.
- Temporal contrast exploitation: SNN layers and spiking transformers are intrinsically robust to rapid subject motion and scene panning.
- Equipment and identity-aware hybrids: Integrating segmentation, ReID, and motion-prior components yields state-of-the-art F1 and resilience to identity switching in challenging disciplines, especially Freestyle.
Recommendations for further progress include collecting real DVS datasets, developing adaptive voxelization strategies for event data, augmenting annotations with camera ID and coarse trajectory priors, and exploring end-to-end training of hybrid architectures that merge appearance, motion, and semantic cues.
eSkiTB thus catalyzes research into neuromorphic vision for sports analytics, with implications for autonomous drone tracking, performance analysis, and robust computer vision under difficult environmental conditions (Vinod et al., 10 Jan 2026, Penta et al., 26 Feb 2025, Li et al., 28 Feb 2025).