TartanAir: A Synthetic Dataset for Visual SLAM

Updated 23 January 2026

TartanAir is a large-scale, photorealistic synthetic dataset designed to benchmark and train visual SLAM algorithms with pixel-accurate, multi-modal sensor data.
It leverages Unreal Engine environments to simulate diverse conditions such as dynamic weather, lighting changes, and complex 6-DoF motions that overcome the limitations of traditional benchmarks.
Researchers can utilize its extensive annotations and rigorous verification pipeline to evaluate both classical geometric and deep-learning based SLAM approaches under challenging real-world scenarios.

TartanAir is a large-scale, photorealistic synthetic dataset designed to challenge and advance the development of visual SLAM (Simultaneous Localization and Mapping) and related data-driven methods in robotics. Collected entirely in Unreal Engine–based environments, TartanAir offers multi-modal sensor streams, pixel-accurate ground truth labels, and extreme scene diversity across dynamic, weather-rich, and illumination-varying conditions. Its extensible annotations and active-stereo supplement provide a comprehensive benchmark and training corpus for both classical geometric and deep-learned perception algorithms (Wang et al., 2020, Warburg et al., 2021).

1. Motivation and Dataset Scope

The primary motivation for TartanAir stems from the limitations in established SLAM benchmarks (e.g., KITTI, TUM RGB-D, EuRoC), where scenarios are predominantly static, motion is simplistic (mainly ground vehicles), and environmental dynamics such as weather, lighting, and moving objects are limited. Data acquisition in these benchmarks is further hindered by cost, logistics, and restricted scene diversity. Synthetic datasets, while helpful for scale and annotation, often suffer from low diversity and sim-to-real transfer issues.

TartanAir seeks to overcome these shortcomings by offering:

A massive, multi-modal synthetic dataset
Rich diversity in scene type, layout, appearance, and dynamic events
Precise, pixel-level ground truth, including pose, depth, and dense/semantic information
Support for extreme and rare phenomena (dynamic weather, moving obstacles, varied lighting)
Utility as both a performance benchmark and a large-scale training set

TartanAir consists of 1,037 trajectories and over 1,000,000 frames spanning 30 distinct simulated environments. This scale is an order of magnitude larger than datasets such as KITTI (≈22 sequences, ∼39,000 stereo frames) (Wang et al., 2020, Warburg et al., 2021).

2. Simulation Environments, Motion Patterns, and Variations

TartanAir encompasses 30 Unreal Engine environments grouped into six semantic categories: urban, rural, nature, domestic, public, and sci-fi. Environments feature indoor (corridors, houses, factories) and outdoor (streets, forests, underwater) scenes, with both structured and unstructured layouts (e.g., engineered roads versus rocky trails).

Key environmental and dynamic variations include:

Moving objects (human avatars, vehicles, machinery, animals, natural motion such as foliage)
Lighting conditions (day/night cycles, lens flares, dynamic spotlights, area lighting, auto-exposure)
Weather and seasonal changes (rain, fog, snow, storms)
Diverse camera motion: three difficulty modes (Easy: yaw/forward; Medium/Hard: full 6-DoF, roll/pitch, high-speed segments)

This high variability enables evaluation and training for algorithms under motion, appearance, and photometric conditions that are infeasible in existing physical datasets.

Environment Properties	Example Values	Configurable Dynamics
Scene number	30	Moving objects, weather, lighting
Terrain types	Flat, stairs, hills	Day/night, fog, rain, snow
Motion patterns	Yaw/forward, 6-DoF	Loop closures, roll/pitch, high speed

3. Sensor Modalities and Ground-Truth Labeling

TartanAir provides synchronized streams at each time step, including:

Stereo RGB images (left/right)
Depth images (per-pixel, linear depth, left camera)
Semantic segmentation masks
Dense optical flow
Stereo disparity maps
Simulated LiDAR point clouds (e.g., 32-beam)
6-DoF camera pose (ground truth, left camera-centric)
Optional: occupancy maps, instance masks

The sensor rig is defined by:

Stereo cameras: 640×512 px, 30 Hz, calibrated intrinsics and extrinsics, baseline ≈ 0.1–0.2 m (Warburg et al., 2021)
Monocular (left) camera co-located with left stereo view
9-axis IMU (200 Hz, accelerometer/gyroscope), rigidly mounted to camera

Mathematical models:

Pose transforms use SE(3): $p_{\rm cam} = T_{\rm world \to cam}\,p_{\rm world}$
Stereo pinhole projection for pixel mapping:

$u = f_x\frac{X}{Z} + c_x,\quad v = f_y\frac{Y}{Z} + c_y$

Disparity-depth relationship:

$d = \frac{b\,f_x}{Z}$

Optical flow and LiDAR are derived from rendered depth and camera pose changes

The “Active TartanAir” supplement (Warburg et al., 2021) extends TartanAir with:

Projected-pattern IR images in “interleaved” mode (D435i-style), for active-stereo research
Semi-dense depth maps (OpenCV SGM output) on pattern-on IR pairs
SLAM-based 3D landmarks and per-frame sparse projections
Extended calibration (including projector–camera transforms)
Complete time synchronization across all streams and sensor types

4. Data Generation, Processing, and Verification Pipeline

TartanAir’s data pipeline is fully automated, encompassing:

a) Incremental Mapping:

Octree-based occupancy grids (0.25 m) are built using simulated depth and pose. Autonomous exploration uses frontiers and RRT* planning for efficient coverage and collision-free mapping. Approximately one hour is required per 100×100×10 m scene.

b) Trajectory Sampling:

Trajectories are sampled in free space, connected via RRT*, refined to yield loop closures, and smoothed to ensure physically plausible, diverse motion (translation ± [0.2–0.5] m, rotation ± [3–10]°). Difficulty levels are assigned by allowed DoF and velocity.

c) Data Processing:

Sensor streams are recorded alongside AirSim API calls. Derived streams (optical flow, disparity, LiDAR) are computed post-hoc. Occlusion/out-of-FOV masks are generated for dense data integrity.

d) Data Verification:

Automated checks enforce time synchronization and consistency (e.g., by pausing the simulator). Photometric consistency is tested via flow-warped error measures; corrupted or out-of-sync recordings are excluded. Spatial collision (penetration of obstacles) is also automatically detected and rejected.

For the “Active TartanAir” augmentation, IR dot patterns (photographed from a real D435i sensor) are rendered using Unity, with interleaved on/off acquisition at 30 Hz, and per-frame ground truth, SGM, and SLAM-based sparse map files are produced (Warburg et al., 2021).

5. Benchmarking, Evaluation, and Findings

TartanAir serves as both a benchmark and a training source for SLAM, VO, and multi-modal scene analysis. Algorithmic evaluations include ORB-SLAM (mono/stereo) and DSO. Standard metrics are:

Absolute Trajectory Error (ATE)
Relative Pose Error (translation/rotation)
Success Rate (SR): fraction of runs retaining track over 200 frames

Empirical findings from (Wang et al., 2020):

On TartanAir ‘Easy’ mode, SR is often <90%, whereas on KITTI, SR is typically >95%
Under ‘Hard’ (6-DoF, high-speed) settings, SR drops to <10% for ORB-M and DSO
Both monocular and stereo methods degrade under rain, storm, or dynamic object conditions, with monocular SR dropping >50%
Feature-based SLAM approaches fail during low-light (day ↔ night); direct methods (e.g., DSO) can occasionally outperform in these extremes
Motion diversity measured by $\sigma = \frac{1}{2}\left(\frac{\sqrt{t_2 t_3}}{t_1} + \frac{\sqrt{r_2 r_3}}{r_1}\right)$ is much higher in TartanAir (≈0.95) than in KITTI (≈0.005), underscoring its coverage of challenging 6-DoF maneuvers

This highlights the limitations of present SLAM algorithms in realistic, highly dynamic, and adverse scenarios, revealing substantial performance gaps that are not exposed by simpler benchmarks.

6. Access, Integration, and Limitations

TartanAir and its active-stereo supplement are publicly available for academic use at http://theairlab.org/tartanair-dataset (Wang et al., 2020, Warburg et al., 2021). Active TartanAir (≈8 GB) and a PyTorch DataLoader are distributed through associated project repositories.

Researchers are advised to:

Avoid mixing pattern-off SGM depths with pattern-on inputs; always pair depth_sgm_on with image_ir_on for active-stereo tasks
Pre-train on synthetic data, then fine-tune on real D435i captures for best transfer
Be aware that some real sensor artifacts (e.g., noise, rolling shutter) are not simulated; the projected-dot pattern is an approximation
Recognize that dynamic moving objects may not be fully addressed by self-supervised photometric loss or single-frame annotation

A plausible implication is that TartanAir is best used for benchmarking in-the-wild visual SLAM robustness and pre-training deep models prior to real-world deployment.

7. Research Significance and Applications

TartanAir fills a unique gap in SLAM and visual perception research: it provides a high-fidelity synthetic environment for robust benchmarking and pre-training, with rare coverage of extreme dynamics, weather, lighting, and semantic variation. Its multi-modal sensor suite and pixel-perfect ground truth serve algorithmic evaluation (SLAM, VO, segmentation, depth completion, and scene flow) and learning-based method development. The active-stereo supplement further enables research in self-supervised depth completion with dense and sparse map supervision (Warburg et al., 2021). This constitutes a reference corpus for robotics researchers aiming to extend SLAM capabilities beyond the limitations of legacy datasets and controlled environments.

Markdown Report Issue Upgrade to Chat

References (2)

TartanAir: A Dataset to Push the Limits of Visual SLAM (2020)

Self-Supervised Depth Completion for Active Stereo (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TartanAir Dataset.