Papers
Topics
Authors
Recent
Search
2000 character limit reached

AIDEV POP Dataset Overview

Updated 4 February 2026
  • AIDEV POP Dataset comprises two distinct resources: POP909 for music arrangement and So2Sat POP for urban population estimation.
  • POP909 provides aligned MIDI and audio data with detailed annotations like beat, chord, key, and tempo for deep learning in music generation.
  • So2Sat POP integrates multispectral satellite imagery, DEM, and OSM features to benchmark urban population estimation using robust machine learning methods.

The term “AIDEV POP Dataset” encompasses two unrelated resources in academic literature, each denoted as “POP” within their respective communities: POP909—a benchmark for data-driven music arrangement research—and So2Sat POP—a multisource geospatial dataset for urban population estimation. Each has established significance for deep learning and data-centric research, but they address entirely different modalities and domains.

1. Overview and Scope

POP909: Music Arrangement Dataset

POP909 is a curated symbolic and audio resource targeting automatic music arrangement generation. Comprising 909 popular songs spanning ca. 1950–2010, it features multi-version professional arrangements (final qualified and intermediate variants), fully aligned MIDI and audio, as well as detailed beat, key, chord, and tempo annotations. The population of the dataset refers to songs, not people (Wang et al., 2020).

So2Sat POP: Population Estimation Dataset

So2Sat POP is an urban population estimation benchmark for 98 European cities. It integrates Sentinel-2 satellite imagery, a digital elevation model, land use, local climate zones, nighttime lights, and OpenStreetMap-derived features for over 1.1 million 1×1 km grid patches. Each patch is labeled with GEOSTAT-derived census population counts and discretized classes (Doda et al., 2022).

The term “AIDEV POP Dataset” therefore refers either to a dataset for symbolic music processing (POP909) or to a remote sensing–driven spatial population mapping resource (So2Sat POP), depending on context.

2. Data Composition and Modalities

POP909

  • MIDI Modalities: Each song contains tracks for MELODY (vocal lead), BRIDGE (secondary/countermelody), and PIANO (accompaniment textures), with both final and intermediate arrangement stages available.
  • Audio Modality: Studio recordings aligned with MIDI via hand-labeled, piecewise-linear tempo curves T(τ)T(\tau) mapping MIDI ticks τ\tau to wall-clock time.
  • Annotations: Text files denoting beat and downbeat positions, chord labels (triad, seventh, suspended, sixth, and their inversions), and key segmentations (e.g., {C:maj, G:min}).

So2Sat POP

  • Remote Sensing: Sentinel-2 (13-band, four seasonal mosaics, resampled to 10 m), TanDEM-X DEM (up-sampled to 10 m), VIIRS nighttime lights.
  • Thematic Data: Local Climate Zones (17 classes), OSM land use proportions (commercial, industrial, residential, other).
  • Raw and Aggregated OSM: Low-level features (node/way/relation counts by OSM tag group) and rasterized high-level building function.
  • Population Labels: GEOSTAT 1 km grid, each patch labeled PcellP_\text{cell} with classes CcellC_\text{cell} via Ccell=0C_\text{cell} = 0 if Pcell=0P_\text{cell} = 0, else k+1k+1 for 2kPcell<2k+12^k \leq P_\text{cell} < 2^{k+1}, kmax=16k_\text{max}=16.
Domain Principal Modalities Label Type
POP909 MIDI, audio, beat/chord/key annotation Song/structural
So2Sat POP Multispectral imagery, DEM, OSM, LCZ Population count

3. Annotation, Alignment, and Preprocessing

POP909

  • Tempo Curve T(τ)T(\tau): Manually labeled, enabling sub-100 ms alignment between symbolic and audio modalities.
  • Beat Extraction: MIDI—T(τ) + autocorrelation-based measure estimation and optimal phase search; Audio—RNN-based joint beat/downbeat tracker.
  • Chord and Key: Chord labels by template matching (MIDI) and CQT-based transcription (audio), with >75% root-note match across >800 songs; key via framewise estimation plus median-filter smoothing.

So2Sat POP

  • Coregistration: All raster layers reprojected to ETRS89-LAEA Europe (EPSG:3035), with spatial upsampling to 10 m from source resolutions as needed.
  • Feature Extraction: OSM via Osmosis and OSMnx, raster summaries of building functions, LCZ resampled to 10 m.
  • Patch Filtering: Border grid cells <900,000 m² discarded; cells missing in census set to Pcell=0P_\text{cell}=0.

4. Data Structure, Distribution, and Licensing

POP909

  • Directory Schema: Per-song folders (“001” to “909”) each containing MIDI versions, aligned audio, annotation files (beat, chord, key), and metadata index (index.csv).
  • File Formats: MIDI, WAV, plain-UTF8 tab-separated annotation files.
  • Licensing: Research-use license; access via GitHub (https://github.com/music-x-lab/POP909-Dataset); non-commercial only, must cite Wang et al., ISMIR 2020 (Wang et al., 2020).

So2Sat POP

  • Data Distribution: Two parts (main data/ancillaries); ∼98 GB and ∼5.2 GB, hosted on TUM MediaTUM.
  • Folder Organization: City-based, each with up to seven subfolders (seasonal imagery, LCZ, land use, VIIRS, OSM features), master CSV for PcellP_\text{cell} and CcellC_\text{cell}; DEM and raw OSM in Part 2.
  • Licensing: CC BY 4.0 (main), CC BY-SA 4.0 (DEM/OSM); code on GitHub (https://github.com/zhu-xlab/So2Sat-POP) (Doda et al., 2022).

5. Applications and Benchmarking

POP909

Enables:

  • Unconditional symbolic composition and arrangement generation using deep models
  • Conditional stimulus-response tasks, e.g., piano accompaniment generation conditioned on melody
  • Score inpainting, harmonization, and expressive performance rendering
  • Baseline system: GPT-2–style Transformer (6-layer, rel. positional encoding, event vocabulary for note-on/off/time-shift/velocity), reporting train/test cross-entropy (2.0898/2.3812), accuracy (0.6202/0.5453) in polyphonic music modeling.

So2Sat POP

Enables:

  • Urban population estimation using multimodal supervised ML (regression or classification)
  • Baseline results: Random Forest regression on 18-city test—out-of-bag R20.864R^2 \simeq 0.864, RMSE =1276.26=1276.26, MAE =463.35=463.35; classification (17 classes) accuracy =0.5913=0.5913, balanced accuracy =0.3795=0.3795, macro F1 =0.3833=0.3833
  • Top features in baseline: OSM densities, VIIRS max radiance, LCZ majority class

Recommended split (train: 80 cities, test: 18) ensures standardized benchmarking.

6. Usage Recommendations, Best Practices, and Limitations

  • POP909: Use hand-labeled tempo for alignment; leverage multi-version arrangements for tasks involving arrangement refinement; cite the dataset for research use.
  • So2Sat POP: Standardize all raster inputs to EPSG:3035, use full grid cropping, discard incomplete border tiles, take class imbalance into account (non-urban cell dominance), and align ancillary data to 10 m grid.
  • Related Limitations: So2Sat POP faces temporal mismatch (2011 census, 2016–17 imagery), granularity limits (1 km grid), and variable census accuracy. POP909’s representational focus is Western pop piano arrangements, which may not generalize to other genres or instrumentation.

By providing precisely aligned, richly annotated, and professionally curated data across distinct but critical domains—symbolic music modeling and geospatial population mapping—the POP datasets have become cornerstones in their respective research communities. Both facilitate the development, standardization, and benchmarking of deep learning models for high-level semantic tasks, from arrangement generation and harmonization to urban analytics and population estimation (Wang et al., 2020, Doda et al., 2022). The deliberate design choices in annotation, coregistration, and scope allow reproducible research and robust model evaluation, supporting ongoing progress in music informatics and spatial socio-economic modeling.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AIDEV POP Dataset.