Audio Game Music Dataset Overview

Updated 27 January 2026

Audio game music datasets are curated corpora including audio tracks, symbolic scores, annotations, and metadata that enable multifaceted research.
They support diverse tasks such as symbolic composition, expressive performance mapping, emotion recognition, structural segmentation, and video-conditioned generation.
Challenges include hardware constraints, quantization effects, and subjective annotation practices that can impact reproducibility and model performance.

The term "audio game music dataset" refers to curated corpora comprising audio tracks, symbolic representations, annotations, and related metadata from video game soundtracks, specifically prepared for tasks in automatic music analysis, generation, segmentation, and cross-modal modeling. Such datasets enable rigorous benchmarking of machine learning systems in symbolic composition, expressive performance modeling, emotion recognition, structural segmentation, and video-conditioned music generation. Recent releases expand beyond symbolic formats to include human-annotated structure and emotion, paired video-audio clips, and synthesizer register traces, supporting multifaceted research on game music semantics.

1. Dataset Composition and Modalities

Audio game music datasets are characterized by their diversity of content modalities (audio, symbolic, annotation), corpus size, instrumentation, and associated metadata:

Symbolic/Expressive Datasets:
- NES-MDB contains 5,278 multi-instrumental tracks from 397 NES titles, preserving both symbolic scores and per-voice expressive controls over four monophonic channels: Pulse 1 (P1), Pulse 2 (P2), Triangle (TR), and Noise (NO). Each track's data encodes note, velocity (4-bit integer [0–15]), and timbre (integer [0–3] or [0–1]), with a 24 Hz time grid yielding granular reconstruction of performance attributes (Donahue et al., 2018).
- YM2413-MDB records 669 tracks spanning 80s Sega/MSX games, capturing FM synthesis commands, rendered audio, and event-based MIDI representations (16 discrete instruments + drums, 48 time-shift steps/bar). Each piece is multi-labeled with 19 fine-grained emotion tags, curated via multi-stage expert and verifier protocols (Choi et al., 2022).
Audio-Annotation Datasets:
- SAVGM comprises 309 post-2000 game music tracks with detailed structural segmentation annotations, covering functional, phrase-motive, and section boundaries. Boundaries are marked with perceptual change cues (e.g., repetition, timbre, rhythm), segment IDs, and functional labels; annotations are formatted as time-ordered CSV aligned with the corresponding 44.1 kHz WAV (Luo et al., 19 Jan 2026).
Paired Video-Audio Datasets:
- NES-VMDB scales up to 98,940 clips (474 hours) from 389 NES games, pairing 15-second gameplay videos with matched NES-MDB symbolic soundtracks via Shazam-style fingerprinting. Each clip is linked to its source via metadata, enabling tasks in cross-modal alignment and genre-conditioned generation (Cardoso et al., 2024).

Dataset	Tracks (Games)	Modalities	Key Annotation
NES-MDB	5,278 (397)	Score, Expressive, Audio	Dynamics, Timbre
YM2413-MDB	669 (n/a)	VGM, MIDI, Audio	Emotion (19 tags)
SAVGM	309 (n/a)	Audio, Structure	Segmentation boundaries
NES-VMDB	98,940 (389)	Audio, Video, MIDI	Genre, Video mapping

The breadth of formats enables research spanning symbolic composition (P(c)), expressive mapping (P(e|c)), supervised segmentation, emotion recognition, and video-conditioned music generation.

2. Annotation Protocols and Feature Extraction

Annotation protocols vary in their dimensional focus, supporting multifactor segmentation, expressive mapping, and high-level semantics:

Structural Annotations (SAVGM):
- Annotations at three levels: fine (phrases/motives), coarse (sections), and functional (game-specific tags: intro, main theme, transition, etc.).
- Boundaries are assigned using Gestalt-inspired perceptual rules, coded into seven categories (rhythm, dynamic, timbre, pitch, harmony, regularity, repetition) and recorded as timestamped CSV entries. Segments are defined as intervals Sᵢ = [tᵢ, tᵢ₊₁), with overlapping cues for richness (Luo et al., 19 Jan 2026).
- Feature extraction from audio utilizes librosa to resample to 22,050 Hz and computes MFCC (13-dim), CQT magnitudes, and onset envelope per 2048-sample frame. Inputs for modeling consist of concatenated, Z-scored feature matrices (T × 98).
Emotion Annotations (YM2413-MDB):
- Dual annotators freely described each track, later mapping to a fixed 19-adjective vocabulary (e.g., "tense," "cheerful," "depressed"). Each track is assigned 1–5 tags and a single "top tag," with three independent verifiers ensuring tag reliability; majority agreement required (pairwise κ ≈ 0.72).
- Metadata is stored in annotations.csv, mapping track IDs to all assigned tags and top tags (Choi et al., 2022).
Expressive Controls (NES-MDB):
- For mutable voices (P1, P2, NO), each time-step includes (note, velocity, timbre). The triangle is limited to note number and On/Off status.
- Discrete values permit cycle-accurate audio resynthesis via NES APU emulation, with state-space ≈ 2⁴⁰ at each symbolic step.
- File formats include raw VGM, expressive scores in JSON/CSV, separated/blended MIDI, and 44.1 kHz rendered audio (Donahue et al., 2018).

3. Data Alignment and Synchronization Methodologies

Audio game music datasets employ alignment methods ensuring rigorous cross-modal correspondence:

NES-VMDB Video/Audio Alignment:
- Long-play videos for each NES game are segmented into fixed-length 15 s clips.
- Audio extraction via FFmpeg (mono WAV, 44.1 kHz).
- Shazam-like fingerprinting (Dejavu): spectrograms, local peak detection, time-window hash formation, and track identification. Query accuracy observed at ≈96% for music-containing clips, with false positives mainly due to sound-effect-only segments (Cardoso et al., 2024).
Expressive-to-Audio Alignment (NES-MDB):
- All NES ROM audio is first logged as VGM—a register-level representation permitting exact state restoration.
- Custom toolboxes parse VGM into symbolic and expressive scores, downsampled to 24 Hz.
- Resynthesis is achieved by re-encoding expressive scores into register traces and emulating the APU; output WAV matches original hardware response (Donahue et al., 2018).
Emotion Labelling Alignment (YM2413-MDB):
- VGM-to-MIDI and VGM-to-WAV conversions are quantified per track; tempo maps align each MIDI file to a detected downbeat grid using Madmom library (Choi et al., 2022).

4. Baseline Models, Tasks, and Evaluation Metrics

Dataset creators provide baseline models, evaluation protocols, and quantitative results to establish performance standards:

Symbolic Composition (NES-MDB):
- Models: Per-voice Unigram/Bigram (MLE), RNN Soloists, LSTM Quartet (joint prediction), DeepBach (bidirectional LSTM + Gibbs).
- Metrics: Negative log-likelihood (NLL), accuracy, POI-based framewise evaluation (points of interest, i.e., note on/offs only).
- DeepBach achieves NLL ≈ 0.75 bits/frame (aggregate) and POI accuracy ≈ 94%, but is slower to sample than LSTM Quartet (Donahue et al., 2018).
Expressive Performance Mapping (NES-MDB):
- Models: Multinomial regression (MultiReg), LSTM Note, LSTM Note+Auto (mix of bidirectional LSTM on pitches and forward LSTM on expressive history).
- Metrics: NLL per output stream, POI accuracy, overall accuracy.
- LSTM Note+Auto yields NLL ≈ 3.42 bits/frame, overall accuracy ≈ 77% (Donahue et al., 2018).
Emotion Recognition and Generation (YM2413-MDB):
- Emotion classification: Logistic Regression and LSTM-Attention on symbolic features yield 4-quadrant accuracy up to 0.48, arousal up to 0.76; audio-based ResNet reaches 0.65 on 4Q task.
- Emotion-conditioned symbolic generation via GPT-2 Transformer shows trend fidelity: "cheerful" samples exhibit denser, shorter notes, "depressed" samples sparser, longer notes; quantitative matches to training statistics (Choi et al., 2022).
Structural Segmentation (SAVGM):
- Supervised CNN–RNN segmentation model trained on SAVGM and SALAMI achieves Precision = 0.512, Recall = 0.667, F₁ = 0.537 at 3 s tolerance windows, comparable to unsupervised state-of-the-art (Luo et al., 19 Jan 2026).
Video-Conditioned Music Generation (NES-VMDB):
- Controllable Music Transformer (CMT): sequence model conditioned on rhythmic features extracted from input video.
- Metrics: Grooving pattern similarity, pitch-class entropy, pitch range, polyphony.
- CMT (conditioned) approaches human music statistics more closely in all metrics than unconditional models (e.g., groove similarity of 0.821 vs. 0.694 unconditioned and 0.999 human).
- Genre classifier achieves 29% test accuracy (human ground truth) vs. 22% (conditioned CMT), with random at ≈ 9% (Cardoso et al., 2024).

5. Applications, Licensing, and Resource Access

Audio game music datasets address a broad spectrum of use-cases and adhere to diverse licensing models:

Symbolic and Expressive Generation:
- End-to-end pipelines for separated score and expressive mapping (P(c), P(e|c)); turnkey NES-style renderings for melodic sketches; style transfer across game-music aesthetics (Donahue et al., 2018, Cardoso et al., 2024).
- Emotion-conditioned composition and retrieval (YM2413-MDB) (Choi et al., 2022).
Structure-Aware Analysis:
- Supervised and unsupervised segmentation; theme detection, clustering; context-aware music recommendation for interactive environments (menus, battles, cutscenes) (Luo et al., 19 Jan 2026).
Cross-Modal Retrieval:
- Video-to-music alignment and genre-conditioned generation; benchmarking of video-conditioned or cross-modal retrieval systems (NES-VMDB) (Cardoso et al., 2024).
Licensing:
- NES-MDB: academic non-commercial (Donahue et al., 2018).
- NES-VMDB: MIT open-source; all code and files at https://github.com/rubensolv/NES-VMDB (Cardoso et al., 2024).
- YM2413-MDB: CC BY 4.0, publicly hosted at https://github.com/jech2/YM2413-MDB (Choi et al., 2022).
- SAVGM: available by request, respecting original rights via OST archives/YouTube (Luo et al., 19 Jan 2026).

6. Limitations and Challenges

Dataset design is shaped by practical and historical constraints:

Hardware-Imposed Orchestration Limits:
- NES-MDB and NES-VMDB enforce four parallel channels (no sampling/PCM), limiting orchestration richness (Donahue et al., 2018, Cardoso et al., 2024).
- YM2413-MDB restricts instrumentation to YM2413 presets, with variable instrument fidelity and fixed drum channel (Choi et al., 2022).
Discrete versus Continuous Control:
- Expressive controls in NES-MDB are quantized (4-bit velocity, 2-bit timbre), which differs from acoustic instrument phrasing (Donahue et al., 2018).
Annotation Subjectivity and Coverage:
- Emotion tags in YM2413-MDB involve verifier agreement (average κ ≈ 0.72); structural segmentations in SAVGM reflect annotator perceptions, thus studies include inter-annotator reliability (Luo et al., 19 Jan 2026, Choi et al., 2022).
Temporal and Genre Coverage:
- SAVGM focuses on post-2000 OSTs; NES-MDB and YM2413-MDB represent 80s/early 90s hardware. Newer genres and production styles may be underrepresented in available datasets (Donahue et al., 2018, Choi et al., 2022, Luo et al., 19 Jan 2026).
Redundant Frames and Tempo Inference:
- NES-MDB's fixed 24 Hz rate introduces ≈83% redundancy (few change events per frame), POI-based approaches mitigate biased evaluation (Donahue et al., 2018).
- Absence of explicit tempo data in NES datasets requires tempo inference via audio analysis or sampling at fixed grid.

A plausible implication is that research requiring richer orchestration or naturalistic phrasing may require integration with datasets beyond the current scope, and that annotation subjectivity underscores the need for reproducible labeling protocols. Increasing corpus diversity and producing open-access datasets with standardized splits would further benefit evaluation and downstream deployment.