AIDEV POP Dataset Overview
- AIDEV POP Dataset comprises two distinct resources: POP909 for music arrangement and So2Sat POP for urban population estimation.
- POP909 provides aligned MIDI and audio data with detailed annotations like beat, chord, key, and tempo for deep learning in music generation.
- So2Sat POP integrates multispectral satellite imagery, DEM, and OSM features to benchmark urban population estimation using robust machine learning methods.
The term “AIDEV POP Dataset” encompasses two unrelated resources in academic literature, each denoted as “POP” within their respective communities: POP909—a benchmark for data-driven music arrangement research—and So2Sat POP—a multisource geospatial dataset for urban population estimation. Each has established significance for deep learning and data-centric research, but they address entirely different modalities and domains.
1. Overview and Scope
POP909: Music Arrangement Dataset
POP909 is a curated symbolic and audio resource targeting automatic music arrangement generation. Comprising 909 popular songs spanning ca. 1950–2010, it features multi-version professional arrangements (final qualified and intermediate variants), fully aligned MIDI and audio, as well as detailed beat, key, chord, and tempo annotations. The population of the dataset refers to songs, not people (Wang et al., 2020).
So2Sat POP: Population Estimation Dataset
So2Sat POP is an urban population estimation benchmark for 98 European cities. It integrates Sentinel-2 satellite imagery, a digital elevation model, land use, local climate zones, nighttime lights, and OpenStreetMap-derived features for over 1.1 million 1×1 km grid patches. Each patch is labeled with GEOSTAT-derived census population counts and discretized classes (Doda et al., 2022).
The term “AIDEV POP Dataset” therefore refers either to a dataset for symbolic music processing (POP909) or to a remote sensing–driven spatial population mapping resource (So2Sat POP), depending on context.
2. Data Composition and Modalities
POP909
- MIDI Modalities: Each song contains tracks for MELODY (vocal lead), BRIDGE (secondary/countermelody), and PIANO (accompaniment textures), with both final and intermediate arrangement stages available.
- Audio Modality: Studio recordings aligned with MIDI via hand-labeled, piecewise-linear tempo curves mapping MIDI ticks to wall-clock time.
- Annotations: Text files denoting beat and downbeat positions, chord labels (triad, seventh, suspended, sixth, and their inversions), and key segmentations (e.g., {C:maj, G:min}).
So2Sat POP
- Remote Sensing: Sentinel-2 (13-band, four seasonal mosaics, resampled to 10 m), TanDEM-X DEM (up-sampled to 10 m), VIIRS nighttime lights.
- Thematic Data: Local Climate Zones (17 classes), OSM land use proportions (commercial, industrial, residential, other).
- Raw and Aggregated OSM: Low-level features (node/way/relation counts by OSM tag group) and rasterized high-level building function.
- Population Labels: GEOSTAT 1 km grid, each patch labeled with classes via if , else for , .
| Domain | Principal Modalities | Label Type |
|---|---|---|
| POP909 | MIDI, audio, beat/chord/key annotation | Song/structural |
| So2Sat POP | Multispectral imagery, DEM, OSM, LCZ | Population count |
3. Annotation, Alignment, and Preprocessing
POP909
- Tempo Curve : Manually labeled, enabling sub-100 ms alignment between symbolic and audio modalities.
- Beat Extraction: MIDI—T(τ) + autocorrelation-based measure estimation and optimal phase search; Audio—RNN-based joint beat/downbeat tracker.
- Chord and Key: Chord labels by template matching (MIDI) and CQT-based transcription (audio), with >75% root-note match across >800 songs; key via framewise estimation plus median-filter smoothing.
So2Sat POP
- Coregistration: All raster layers reprojected to ETRS89-LAEA Europe (EPSG:3035), with spatial upsampling to 10 m from source resolutions as needed.
- Feature Extraction: OSM via Osmosis and OSMnx, raster summaries of building functions, LCZ resampled to 10 m.
- Patch Filtering: Border grid cells <900,000 m² discarded; cells missing in census set to .
4. Data Structure, Distribution, and Licensing
POP909
- Directory Schema: Per-song folders (“001” to “909”) each containing MIDI versions, aligned audio, annotation files (beat, chord, key), and metadata index (index.csv).
- File Formats: MIDI, WAV, plain-UTF8 tab-separated annotation files.
- Licensing: Research-use license; access via GitHub (https://github.com/music-x-lab/POP909-Dataset); non-commercial only, must cite Wang et al., ISMIR 2020 (Wang et al., 2020).
So2Sat POP
- Data Distribution: Two parts (main data/ancillaries); ∼98 GB and ∼5.2 GB, hosted on TUM MediaTUM.
- Folder Organization: City-based, each with up to seven subfolders (seasonal imagery, LCZ, land use, VIIRS, OSM features), master CSV for and ; DEM and raw OSM in Part 2.
- Licensing: CC BY 4.0 (main), CC BY-SA 4.0 (DEM/OSM); code on GitHub (https://github.com/zhu-xlab/So2Sat-POP) (Doda et al., 2022).
5. Applications and Benchmarking
POP909
Enables:
- Unconditional symbolic composition and arrangement generation using deep models
- Conditional stimulus-response tasks, e.g., piano accompaniment generation conditioned on melody
- Score inpainting, harmonization, and expressive performance rendering
- Baseline system: GPT-2–style Transformer (6-layer, rel. positional encoding, event vocabulary for note-on/off/time-shift/velocity), reporting train/test cross-entropy (2.0898/2.3812), accuracy (0.6202/0.5453) in polyphonic music modeling.
So2Sat POP
Enables:
- Urban population estimation using multimodal supervised ML (regression or classification)
- Baseline results: Random Forest regression on 18-city test—out-of-bag , RMSE , MAE ; classification (17 classes) accuracy , balanced accuracy , macro F1
- Top features in baseline: OSM densities, VIIRS max radiance, LCZ majority class
Recommended split (train: 80 cities, test: 18) ensures standardized benchmarking.
6. Usage Recommendations, Best Practices, and Limitations
- POP909: Use hand-labeled tempo for alignment; leverage multi-version arrangements for tasks involving arrangement refinement; cite the dataset for research use.
- So2Sat POP: Standardize all raster inputs to EPSG:3035, use full grid cropping, discard incomplete border tiles, take class imbalance into account (non-urban cell dominance), and align ancillary data to 10 m grid.
- Related Limitations: So2Sat POP faces temporal mismatch (2011 census, 2016–17 imagery), granularity limits (1 km grid), and variable census accuracy. POP909’s representational focus is Western pop piano arrangements, which may not generalize to other genres or instrumentation.
7. Broader Significance and Related Work
By providing precisely aligned, richly annotated, and professionally curated data across distinct but critical domains—symbolic music modeling and geospatial population mapping—the POP datasets have become cornerstones in their respective research communities. Both facilitate the development, standardization, and benchmarking of deep learning models for high-level semantic tasks, from arrangement generation and harmonization to urban analytics and population estimation (Wang et al., 2020, Doda et al., 2022). The deliberate design choices in annotation, coregistration, and scope allow reproducible research and robust model evaluation, supporting ongoing progress in music informatics and spatial socio-economic modeling.