Road Extraction Dataset Overview
- Road Extraction Dataset is a curated collection of georeferenced imagery paired with machine-readable road network annotations that support automated extraction and mapping.
- It encompasses diverse geographic regions and scales with various annotation protocols, including polygonal outlines and graph-based representations for robust benchmarking.
- Evaluation metrics focus on both geometric fidelity and topological correctness, driving improvements in model performance for tasks such as map updates and navigation.
A road extraction dataset is a curated collection of georeferenced imagery paired with precise, machine-readable annotations of road networks—typically represented as polygons, graphs, or centerlines—intended for training, validating, and benchmarking computational methods for automated road extraction, network reconstruction, map update, and related tasks. These datasets vary in spatial scale, annotation granularity, coverage diversity, and are evaluated using metrics that capture both geometric @@@@1@@@@ and topological correctness. The landscape encompasses datasets specialized for urban, rural, off-road, and global contexts, each with meticulously documented protocols to ensure reproducibility, generalizability, and utility across research and application domains.
1. Dataset Taxonomy and Scope
Road extraction datasets span diverse geographic, semantic, and technical axes. Typical categorization includes:
- Urban, Rural, Off-Road, and Global-Scale: Datasets such as Map2ImLas target high-resolution urban and suburban contexts with polygonized road surfaces, while Global-Scale and WildRoad introduce graph-based labels for both urban and rural/off-road regions, the latter extending coverage to unpaved, ambiguous, or occluded tracks (Jiao et al., 29 Apr 2025, Yin et al., 2024, Guan et al., 11 Dec 2025).
- Imagery Source and Resolution: Inputs range from orthophoto tiles at 7.5 cm GSD (Map2ImLas: 4000×4000 px), 1 m/pixel multispectral satellite scenes (Global-Scale: 2048×2048 px), 0.3 m/pixel high-res off-road images (WildRoad: 8192×4096 px), to compact binary segmentation masks (Toulouse: 64×64 px) (Belli et al., 2019).
- Region Diversity: Coverage varies from regional (Deventer, Enschede, Giethoorn in Map2ImLas), urban US cities (MUNO21), to continent-spanning (WildRoad) or global (Global-Scale) repositories, with corresponding attention to generalization across domains (Bastani et al., 2021).
2. Annotation Protocols and Representations
The structure and semantics of road annotations depend on dataset design and the intended modeling paradigm.
- Polygonal Outlines: Map2ImLas annotates full road surfaces per the Dutch BGT standard as polygons (or multipolygons) with strict boundary adherence, including internal holes for non-road islands and explicit junction handling. All vector data are stored in GeoJSON (EPSG:28992) and partitioned by input images (Jiao et al., 29 Apr 2025).
- Graph-Based Representations: Global-Scale and WildRoad express road networks as graphs , where includes junctions, endpoints, and keypoints and encodes connectivity. Attributes may include degree, edge length, and geocoordinates, often serialized as GeoJSON or binary adjacency matrices plus coordinate arrays (Yin et al., 2024, Guan et al., 11 Dec 2025, Belli et al., 2019).
- Change-Centric Annotation: MUNO21 uniquely provides pre- and post-change OSM-derived graphs (, ) per scenario for the map update task, augmented with fine-grained change tags (e.g., Constructed, Was-Missing, Deconstructed) and time series imagery (Bastani et al., 2021).
- Quality Control: Dual-pass or multi-round manual/expert annotation, automated topological validation (removal of self-intersections, dangling edges), and targeted visual review (10–50% of samples) are standard for high-fidelity labeling (Jiao et al., 29 Apr 2025, Guan et al., 11 Dec 2025).
3. Quantitative Characteristics and Splits
Road extraction datasets are characterized by carefully enumerated statistics to support benchmarking and statistical rigor.
| Dataset | Area/Size | #Tiles/Images | Split (Train/Val/Test/OOD) | Road Poly/Graph Stats |
|---|---|---|---|---|
| Map2ImLas | ≈ 850 km² (NL) | 303 images | 3584/—/448 (Deventer), 400, 416 | ~18k/2.2k/2.3k polygons; mean 48 vertices/poly; width: 22% <3 m, 57% 3–6 m, 21% >6 m |
| Global-Scale | ~13,800 km² | 3,468 tiles | 2375/339/624/130 | Vector graph from OSM; tens of 1000s km centerline |
| WildRoad | ~2,100 km² | 221 images | 6448/1493/1333 patches | 4,000+ km road, >11k junctions, 35k endpoints in train |
| Toulouse | ~110 m²/tile | 111,034 tiles | 80,357/11,679/18,998 | 4–9 nodes & 5–15 edges per graph; 64×64 px segmentation masks |
| MUNO21 | 6,052 km² | 39 tiles | 726/568 scenarios (train/test) | 514 change windows, 780 no-change; 948 km changed roads |
Annotation splits are typically region- or city-stratified (to avoid spatial overfitting), with explicit OOD sets (e.g., Hong Kong, Shenzhen, Lucerne for Global-Scale). Patching or tiling (e.g., 256×256 px for Map2ImLas, 1,024×1,024 px for WildRoad) enables fine-grained input to models and local performance analysis (Jiao et al., 29 Apr 2025, Guan et al., 11 Dec 2025).
4. Evaluation Metrics and Benchmarks
Road extraction evaluation encompasses geometric, structural, and operational criteria:
- Polygon Metrics (Map2ImLas):
- Simplicity , with , , ; balances IoU and vertex minimization.
- Boundary Smoothness , promoting regular, non-jagged polygons (Jiao et al., 29 Apr 2025).
- Graph Metrics:
- TOPO (Biagioni & Eriksson): Precision, recall, F₁ on matched reachable vertex pairs.
- APLS (SpaceNet): over sampled vertex pairs; sensitive to topological continuity.
- StreetMover (Toulouse): Entropy-regularized Earth-Mover’s distance between graphs' sampled point clouds; invariant to permutation, translation, rotation (Belli et al., 2019).
- Edge Precision/Recall: Fraction of matched edges within spatial buffers; Edge-F₁ critical for off-road evaluation (WildRoad, Global-Scale).
- Change and Map Update Metrics (MUNO21):
- Improvement Score: Normalized gain in core metric (APLS or pixel-F₁) between pre- and post-change graphs.
- Scenario-Level Precision: Fraction of no-change scenarios where the map remains unaltered post-inference (Bastani et al., 2021).
5. Notable Datasets and Access
Map2ImLas (Jiao et al., 29 Apr 2025)
- Domain: Urban and peri-urban Netherlands; 4000×4000 px aerial orthophotos at 7.5 cm GSD.
- Annotation: Polygons/multipolygons for contiguous paved and unpaved surfaces (motorways, paths, railways, etc.), strict adherence to Dutch BGT definitions.
- Quality: Expert dual-pass, topology-checked, 10% visual review.
- Access: University of Twente open portal; CC BY 4.0.
Global-Scale (Yin et al., 2024)
- Domain: Truly global; 13,800 km², 3,468 scenes spanning urban, rural, mountainous terrain.
- Annotation: OSM-derived centerline graphs, quality-controlled, snapped to roads in satellite imagery.
- Access: https://github.com/earth-insights/samroadplus; open-access (see repository).
WildRoad (Guan et al., 11 Dec 2025)
- Domain: Off-road, unpaved, and wild environments across six continents; 8,192×4,096 px images at 0.3 m/pixel.
- Annotation: Graphs with explicit topology, interactive human-in-the-loop pipeline; >4000 km of road, bootstrapped for efficiency and accuracy.
- Access: https://github.com/xiaofei-guan/attorch_copy; CC BY 4.0.
Toulouse Road Network (Belli et al., 2019)
- Domain: Dense urban, compact 64×64 binary masks; road graphs in tiles ~110 m².
- Annotation: OSM-based segmentation masks and adjacency matrices; filtered for trivial/irregular graphs.
- Access: https://github.com/davide-belli/toulouse-road-network-dataset; MIT License.
MUNO21 (Bastani et al., 2021)
- Domain: Map update for 21 US cities; multi-temporal NAIP imagery and OSM graphs (2012–2019).
- Annotation: Per-scenario pre- and post-change graphs, segment-wise change tags, time-stamped.
- Access: https://favyen.com/muno21/; BSD-style open source.
6. Task Definitions and Research Benchmarks
- Polygonal Extraction: Precise reconstruction of road footprints as polygons emphasizing vertex economy and smoothness (Map2ImLas), evaluated on coverage, efficiency, and regularity.
- Graph Extraction: Direct vectorization of centerline networks, suited for navigation/generalization tasks (Global-Scale, WildRoad, Toulouse). Emphasis on graph connectivity, shortest paths, and robust handling of occlusion and discontinuity (Yin et al., 2024, Guan et al., 11 Dec 2025, Belli et al., 2019).
- Map Update: Incorporates historical context for minimum-edit updates to existing maps, with challenge centering on high-precision insertion, removal, and correction with scenario-level consistency (MUNO21).
- Model Benchmarks: Comparative studies utilize both pixel/IoU and advanced graph/path-based metrics, with baseline and state-of-the-art methods (e.g., LDPoly, SAM-Road++, MaGRoad, GGT) recorded for each dataset (Jiao et al., 29 Apr 2025, Yin et al., 2024, Guan et al., 11 Dec 2025, Belli et al., 2019).
7. Practical Access, Licensing, and Recommendations
- Preprocessing: Standardized input normalization, patching/tiling scripts, and coordinate reprojection guidance are included with major dataset releases (e.g., normalize RGB values to [−1, 1] for diffusion models in Map2ImLas).
- Licensing: Public datasets are largely under permissive licenses (CC BY 4.0, MIT, BSD); OSM-based annotations comply with ODbL, and NAIP imagery is U.S. public domain (Bastani et al., 2021, Jiao et al., 29 Apr 2025).
- Supplementary Resources: Most datasets provide comprehensive splits, demonstration code (including model conditioning, baseline pipelines), and detailed manifests. Some (e.g., WildRoad) package interactive annotation tools alongside imagery and vector data.
This synthesis underscores the centrality of carefully labeled, geographically and semantically diverse road extraction datasets to advancing computational approaches to mapping, autonomy, and infrastructure analytics, setting rigorous baselines and providing the foundation for cross-domain, cross-region algorithmic generalization.