TartanAir: A Synthetic Dataset for Visual SLAM
- TartanAir is a large-scale, photorealistic synthetic dataset designed to benchmark and train visual SLAM algorithms with pixel-accurate, multi-modal sensor data.
- It leverages Unreal Engine environments to simulate diverse conditions such as dynamic weather, lighting changes, and complex 6-DoF motions that overcome the limitations of traditional benchmarks.
- Researchers can utilize its extensive annotations and rigorous verification pipeline to evaluate both classical geometric and deep-learning based SLAM approaches under challenging real-world scenarios.
TartanAir is a large-scale, photorealistic synthetic dataset designed to challenge and advance the development of visual SLAM (Simultaneous Localization and Mapping) and related data-driven methods in robotics. Collected entirely in Unreal Engine–based environments, TartanAir offers multi-modal sensor streams, pixel-accurate ground truth labels, and extreme scene diversity across dynamic, weather-rich, and illumination-varying conditions. Its extensible annotations and active-stereo supplement provide a comprehensive benchmark and training corpus for both classical geometric and deep-learned perception algorithms (Wang et al., 2020, Warburg et al., 2021).
1. Motivation and Dataset Scope
The primary motivation for TartanAir stems from the limitations in established SLAM benchmarks (e.g., KITTI, TUM RGB-D, EuRoC), where scenarios are predominantly static, motion is simplistic (mainly ground vehicles), and environmental dynamics such as weather, lighting, and moving objects are limited. Data acquisition in these benchmarks is further hindered by cost, logistics, and restricted scene diversity. Synthetic datasets, while helpful for scale and annotation, often suffer from low diversity and sim-to-real transfer issues.
TartanAir seeks to overcome these shortcomings by offering:
- A massive, multi-modal synthetic dataset
- Rich diversity in scene type, layout, appearance, and dynamic events
- Precise, pixel-level ground truth, including pose, depth, and dense/semantic information
- Support for extreme and rare phenomena (dynamic weather, moving obstacles, varied lighting)
- Utility as both a performance benchmark and a large-scale training set
TartanAir consists of 1,037 trajectories and over 1,000,000 frames spanning 30 distinct simulated environments. This scale is an order of magnitude larger than datasets such as KITTI (≈22 sequences, ∼39,000 stereo frames) (Wang et al., 2020, Warburg et al., 2021).
2. Simulation Environments, Motion Patterns, and Variations
TartanAir encompasses 30 Unreal Engine environments grouped into six semantic categories: urban, rural, nature, domestic, public, and sci-fi. Environments feature indoor (corridors, houses, factories) and outdoor (streets, forests, underwater) scenes, with both structured and unstructured layouts (e.g., engineered roads versus rocky trails).
Key environmental and dynamic variations include:
- Moving objects (human avatars, vehicles, machinery, animals, natural motion such as foliage)
- Lighting conditions (day/night cycles, lens flares, dynamic spotlights, area lighting, auto-exposure)
- Weather and seasonal changes (rain, fog, snow, storms)
- Diverse camera motion: three difficulty modes (Easy: yaw/forward; Medium/Hard: full 6-DoF, roll/pitch, high-speed segments)
This high variability enables evaluation and training for algorithms under motion, appearance, and photometric conditions that are infeasible in existing physical datasets.
| Environment Properties | Example Values | Configurable Dynamics |
|---|---|---|
| Scene number | 30 | Moving objects, weather, lighting |
| Terrain types | Flat, stairs, hills | Day/night, fog, rain, snow |
| Motion patterns | Yaw/forward, 6-DoF | Loop closures, roll/pitch, high speed |
3. Sensor Modalities and Ground-Truth Labeling
TartanAir provides synchronized streams at each time step, including:
- Stereo RGB images (left/right)
- Depth images (per-pixel, linear depth, left camera)
- Semantic segmentation masks
- Dense optical flow
- Stereo disparity maps
- Simulated LiDAR point clouds (e.g., 32-beam)
- 6-DoF camera pose (ground truth, left camera-centric)
- Optional: occupancy maps, instance masks
The sensor rig is defined by:
- Stereo cameras: 640×512 px, 30 Hz, calibrated intrinsics and extrinsics, baseline ≈ 0.1–0.2 m (Warburg et al., 2021)
- Monocular (left) camera co-located with left stereo view
- 9-axis IMU (200 Hz, accelerometer/gyroscope), rigidly mounted to camera
Mathematical models:
- Pose transforms use SE(3):
- Stereo pinhole projection for pixel mapping:
- Disparity-depth relationship:
- Optical flow and LiDAR are derived from rendered depth and camera pose changes
The “Active TartanAir” supplement (Warburg et al., 2021) extends TartanAir with:
- Projected-pattern IR images in “interleaved” mode (D435i-style), for active-stereo research
- Semi-dense depth maps (OpenCV SGM output) on pattern-on IR pairs
- SLAM-based 3D landmarks and per-frame sparse projections
- Extended calibration (including projector–camera transforms)
- Complete time synchronization across all streams and sensor types
4. Data Generation, Processing, and Verification Pipeline
TartanAir’s data pipeline is fully automated, encompassing:
a) Incremental Mapping:
Octree-based occupancy grids (0.25 m) are built using simulated depth and pose. Autonomous exploration uses frontiers and RRT* planning for efficient coverage and collision-free mapping. Approximately one hour is required per 100×100×10 m scene.
b) Trajectory Sampling:
Trajectories are sampled in free space, connected via RRT*, refined to yield loop closures, and smoothed to ensure physically plausible, diverse motion (translation ± [0.2–0.5] m, rotation ± [3–10]°). Difficulty levels are assigned by allowed DoF and velocity.
c) Data Processing:
Sensor streams are recorded alongside AirSim API calls. Derived streams (optical flow, disparity, LiDAR) are computed post-hoc. Occlusion/out-of-FOV masks are generated for dense data integrity.
d) Data Verification:
Automated checks enforce time synchronization and consistency (e.g., by pausing the simulator). Photometric consistency is tested via flow-warped error measures; corrupted or out-of-sync recordings are excluded. Spatial collision (penetration of obstacles) is also automatically detected and rejected.
For the “Active TartanAir” augmentation, IR dot patterns (photographed from a real D435i sensor) are rendered using Unity, with interleaved on/off acquisition at 30 Hz, and per-frame ground truth, SGM, and SLAM-based sparse map files are produced (Warburg et al., 2021).
5. Benchmarking, Evaluation, and Findings
TartanAir serves as both a benchmark and a training source for SLAM, VO, and multi-modal scene analysis. Algorithmic evaluations include ORB-SLAM (mono/stereo) and DSO. Standard metrics are:
- Absolute Trajectory Error (ATE)
- Relative Pose Error (translation/rotation)
- Success Rate (SR): fraction of runs retaining track over 200 frames
Empirical findings from (Wang et al., 2020):
- On TartanAir ‘Easy’ mode, SR is often <90%, whereas on KITTI, SR is typically >95%
- Under ‘Hard’ (6-DoF, high-speed) settings, SR drops to <10% for ORB-M and DSO
- Both monocular and stereo methods degrade under rain, storm, or dynamic object conditions, with monocular SR dropping >50%
- Feature-based SLAM approaches fail during low-light (day ↔ night); direct methods (e.g., DSO) can occasionally outperform in these extremes
- Motion diversity measured by is much higher in TartanAir (≈0.95) than in KITTI (≈0.005), underscoring its coverage of challenging 6-DoF maneuvers
This highlights the limitations of present SLAM algorithms in realistic, highly dynamic, and adverse scenarios, revealing substantial performance gaps that are not exposed by simpler benchmarks.
6. Access, Integration, and Limitations
TartanAir and its active-stereo supplement are publicly available for academic use at http://theairlab.org/tartanair-dataset (Wang et al., 2020, Warburg et al., 2021). Active TartanAir (≈8 GB) and a PyTorch DataLoader are distributed through associated project repositories.
Researchers are advised to:
- Avoid mixing pattern-off SGM depths with pattern-on inputs; always pair depth_sgm_on with image_ir_on for active-stereo tasks
- Pre-train on synthetic data, then fine-tune on real D435i captures for best transfer
- Be aware that some real sensor artifacts (e.g., noise, rolling shutter) are not simulated; the projected-dot pattern is an approximation
- Recognize that dynamic moving objects may not be fully addressed by self-supervised photometric loss or single-frame annotation
A plausible implication is that TartanAir is best used for benchmarking in-the-wild visual SLAM robustness and pre-training deep models prior to real-world deployment.
7. Research Significance and Applications
TartanAir fills a unique gap in SLAM and visual perception research: it provides a high-fidelity synthetic environment for robust benchmarking and pre-training, with rare coverage of extreme dynamics, weather, lighting, and semantic variation. Its multi-modal sensor suite and pixel-perfect ground truth serve algorithmic evaluation (SLAM, VO, segmentation, depth completion, and scene flow) and learning-based method development. The active-stereo supplement further enables research in self-supervised depth completion with dense and sparse map supervision (Warburg et al., 2021). This constitutes a reference corpus for robotics researchers aiming to extend SLAM capabilities beyond the limitations of legacy datasets and controlled environments.