CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Published 30 Mar 2026 in cs.RO, cs.AI, cs.CV, and cs.HC | (2603.28032v1)

Abstract: The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents CARLA-Air, a unified simulation platform that integrates aerial and ground dynamics in a single process.
It achieves strict spatial-temporal consistency with synchronized capture of up to 18 sensor modalities and zero alignment interpolation.
Empirical results demonstrate sub-millisecond state query latency, stable 20 FPS performance, and seamless API compatibility with CARLA and AirSim.

CARLA-Air: A Unified Simulation Infrastructure for Air-Ground Embodied Intelligence

Motivation and Problem Statement

CARLA-Air addresses a persistent gap in embodied AI and autonomous systems simulation: the inability to jointly simulate aerial (multirotor UAV) and ground (vehicular, pedestrian) agents within a physically coherent, single-process environment. Previous open-source platforms have been segregated by domain focus—CARLA excels at urban driving scenarios but lacks aerial dynamics, while AirSim offers physics-accurate multirotor simulation without realistic ground scenes. Bridge-based co-simulation introduces spatial-temporal inconsistencies and significant synchronization overhead, as quantified by per-frame data transfer latency scaling with sensor count. None of the prior systems simultaneously provide realistic urban traffic, socially-aware pedestrians, physics-based UAV flight, preserved native APIs, and strict single-process execution.

Architectural Contributions

CARLA-Air achieves air-ground integration through a principled engine-level composition, resolving the Unreal Engine single-GameMode constraint. The architecture leverages inheritance from CARLA's game mode for ground subsystems, and composition of AirSim's aerial flight actor as a UE4 world entity (Figure 1).

Figure 1: Coordinate mapping between the CARLA (UE4) and AirSim (NED) frames, showing only a Z-axis sign flip and scale factor conversion; forward/right axes are aligned.

Both CARLA and AirSim native APIs are preserved, enabling seamless code migration and zero modification of existing research pipelines. The single-process design eliminates inter-process synchronization overhead and guarantees perfect spatial-temporal consistency across sensor streams. Dual independent RPC servers support concurrent connections, and extensibility is enabled via an asset pipeline supporting custom robot platforms, UAV configurations, vehicles, and environment maps.

CARLA-Air delivers synchronized capture of up to 18 sensor modalities per tick (including RGB, depth, semantic segmentation, LiDAR, radar, IMU, GNSS, barometry), supporting both aerial and ground platforms. The joint environment enables heterogeneous workloads spanning air-ground cooperation, embodied navigation grounded in vision-language reasoning, multi-modal dataset construction, and reinforcement learning for cooperative policy training.

Benchmark experiments, conducted on RTX A4000 hardware, demonstrate stable throughput of approx. 20 FPS under moderate air-ground workloads, with negligible VRAM growth across 3-hour stability endurance runs (357 actor lifecycle cycles, $R^2 = 0.11$ for VRAM regression slope). Single-process communication achieves sub-millisecond latency for state queries—substantially outperforming bridge-based approaches requiring cross-process serialization.

Representative Workflows and Applications

The paper validates CARLA-Air through five core workflows:

W1: Air-Ground Cooperative Precision Landing: Real-time UAV landing on a moving vehicle with $<0.5$ m terminal error, enabled by tick-synchronous control and cross-domain pose fusion.
Figure 2: Time-lapse of UAV precision landing execution; 3D trajectory convergence and smooth altitude descent profile illustrated along with horizontal error diminishing to below 0.5 m.
W2: Embodied Navigation and VLN/VLA Dataset Generation: Cross-view collection for vision-language instruction grounding, using paired bird's-eye and street-level perspectives.
Figure 3: Drone tracks pedestrian in urban environment, with visual observations and annotated reasoning chains per frame; persistent target tracking achieved via aerial overview.
W3: Synchronized Multi-Modal Dataset Collection: Simultaneous high-fidelity sensor capture across air and ground in a single tick, with measured tick-index alignment error below one tick.
Figure 4: Twelve modalities (six ground, six aerial) captured synchronously per tick; unified rendering and weather pipeline ensures spatial-temporal correspondence.
W4: Air-Ground Cross-View Perception: Demonstration of co-registered aerial and ground sensor streams throughout diverse urban environments and weather presets, achieving perfect temporal and rendering alignment.
Figure 5: Rows show aerial RGB samples from six CARLA maps; columns traverse weather presets, confirming rendering consistency and spatial co-registration.
W5: RL Training Environment: Stable closed-loop joint air-ground RL scenario with zero crash or error across hundreds of resets; reward design based on physically consistent inter-agent positioning.
Figure 6: RL agent learns to maintain optimal observation over moving vehicle; pipeline diagram details synchronous observations and action execution on both air and ground.

Extensibility and Asset Integration

Through an extensible asset import pipeline, CARLA-Air supports integration of custom robots, vehicles, UAVs, and environment maps as spawnable actors within the unified world. All imported assets participate in the shared physics tick, rendering, and sensor visibility. This capability is crucial for evaluating domain-adapted platforms and hardware innovations within realistic multi-agent scenarios.

Figure 7: Custom robotic assets (mobile robot, sport car) imported and operated within CARLA-Air's shared simulation world alongside built-in traffic and aerial agents.

Strong Empirical Results and Claims

The moderate joint workload (3 vehicles, 2 pedestrians, 1 drone, 8 sensors) achieves $19.8 \pm 1.1$ FPS.
Integration overhead relative to ground-only baseline is quantified at 8.6 FPS (30.3%), mostly CPU-bound.
Memory stability endurance shows negligible VRAM accumulation ($0.49$ MiB/cycle, $R^2 = 0.11$ ), confirming suitability for RL training and long-run operation.
Synchronous multi-modal data collection yields zero alignment interpolation, outclassing bridge-based architectures.
API compatibility maintained: 89/89 CARLA tests pass, AirSim full sensor access verified, and 63 ROS 2 topics are published.

Theoretical and Practical Implications

CARLA-Air's architectural resolution of the GameMode conflict establishes a scalable, physically consistent foundation for embodied AI research involving heterogeneous agent coordination. Practical implications extend to scalable low-altitude urban mobility, dataset generation for cross-view perception, and RL policy training in mixed-domain environments.

Theoretically, the unified tick-synchronous world state and rendering pipeline enable new paradigms in multi-agent reasoning, visual grounding, and cross-modal fusion. CARLA-Air's guarantee of strict spatial-temporal consistency and native API preservation supports reproducibility and sub-tick-level evaluation unavailable in prior platforms.

Limitations and Future Directions

Current limitations are predominantly related to actor density scaling, map switching (requiring process restarts), and formal validation of multi-drone scenarios. Future work targets physics-state synchronization across engines and broader ecosystem integration (e.g., ROS 2 topic bridges). The repository replaces AirSim's archived upstream as the active maintenance locus for aerial simulation.

Conclusion

CARLA-Air fills a crucial infrastructural gap in air-ground embodied intelligence simulation by delivering physically coherent, high-fidelity urban scenarios with paired aerial and ground modalities in a single Unreal Engine process. Its strict spatial-temporal sensor alignment, unified RPC interface, and extensible asset pipeline unlock previously inaccessible research workflows in cooperative multi-agent robotics, vision-language navigation, and RL policy development. The platform's empirical stability and API compatibility make CARLA-Air a practical foundation for both theoretical advancement and large-scale dataset-driven applications in low-altitude autonomy and cross-domain embodied AI.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about

The paper introduces CARLA‑Air, a free (open‑source) computer simulator that lets flying robots (drones) and ground robots/cars live and act together in the same realistic 3D city. It combines the strengths of two popular simulators:

CARLA (great for cars, roads, traffic, and pedestrians)
AirSim (great for accurate drone flight and drone sensors)

Instead of running them as two separate programs that “talk” to each other, CARLA‑Air carefully puts them inside one shared world that runs on a single engine. That makes everything line up in space and time, which is crucial for robots that learn from synchronized camera, LiDAR, and other sensors.

What questions the authors asked

The authors set out to answer simple but important questions:

Can we build one simulator where cars, pedestrians, and drones all move realistically in the same world, on the same clock?
Can we keep the original programming interfaces (APIs) from CARLA and AirSim so researchers don’t have to rewrite their code?
Can we capture many kinds of sensors (like cameras and laser scans) for both air and ground robots at exactly the same moments?
Can we avoid the lag and confusion that happen when two separate simulators try to stay in sync?
Will this run fast and stable enough to train and test robot behaviors for hours?

How they built it (in everyday terms)

Think of two clubs trying to use the same gym at the same time. Most people might try to split time or use walkie‑talkies to coordinate. That’s the “two separate simulators with a bridge” approach—it’s workable but clunky and often out of sync.

CARLA‑Air instead makes one club the host (CARLA runs the city, roads, traffic, and pedestrians) and invites the other club’s activity (AirSim’s drone flight) into the same gym as a normal participant.

Here’s what that means in practice:

One shared world and clock: Both the city and the drone run inside a single Unreal Engine process, so every camera frame and sensor reading happens at the same tick (moment in time). This keeps all data perfectly lined up.
Keep existing code working: The original CARLA and AirSim Python/ROS 2 APIs still work. If you already have code for either, you can run it without changes.
Solve “one boss only” rule: Unreal Engine only allows one “game mode” (the boss that runs the world). CARLA‑Air makes CARLA the boss, then spawns the drone flight system as a regular world actor. Result: both systems work together without fighting over control.
Make coordinates match: Cars and drones used different ways of measuring direction and units (e.g., centimeters vs. meters; “up” vs. “down”). CARLA‑Air adds a simple conversion (flip the up/down axis and convert cm to m) so positions and orientations agree.
Many sensor types, synchronized: Up to 18 kinds of sensors (RGB camera, depth, semantic segmentation, LiDAR, radar, IMU, GPS‑like GNSS, barometer, and more) can be recorded at the same tick for ground and air platforms.
Easy to add your own robots and maps: There’s an asset import pipeline so you can bring custom drones, vehicles, or environments into the shared world.
Careful testing: They measured speed (frames per second), memory use over time (to catch leaks), and how quickly commands and data round‑trip between your script and the simulator (latency).

What they found and why it matters

Smooth, unified air‑ground simulation: With typical city scenes, a drone, traffic, and multiple sensors, CARLA‑Air runs around 20 frames per second—fast enough for most training and testing loops.
Stable for long runs: Over 3 hours of repeated resets (like in reinforcement learning), performance stayed steady with no crashes and no meaningful memory growth.
Low communication delay: Simple queries return in well under a millisecond, and even image requests are fast enough to keep up with the simulation clock—much quicker than sending data between two separate programs.
Better timing consistency: Because everything lives in one process, sensor data across drones and cars is tightly synchronized. That’s important when training AI models that depend on correctly aligned camera/LiDAR/GPS data.
No code rewrites: Researchers can reuse existing CARLA or AirSim code, lowering the barrier to trying air‑ground tasks.
Keeps AirSim alive: Since AirSim’s original development has slowed down, CARLA‑Air gives that flight stack a modern, actively maintained home.

Why this is important

Having drones and ground robots share the same realistic city simulation opens doors to many useful projects:

Air‑ground teamwork: For example, a drone scouting traffic while a ground robot responds, or cooperative search‑and‑rescue.
Better navigation and decision‑making: Combine the drone’s bird’s‑eye view with ground‑level details for smarter planning.
High‑quality datasets: Collect perfectly matched sensor data from both views to improve 3D mapping, cross‑view recognition, and scene understanding.
Faster robot learning: Train policies with reinforcement learning in a safe, repeatable, and physically consistent world.
Real‑world relevance: Helps develop tech for the “low‑altitude economy,” like urban air mobility, package delivery, and infrastructure inspection.

The bottom line

CARLA‑Air fills a gap: it’s a single, easy‑to‑use, open‑source simulator where drones, cars, and people share one realistic world with synchronized sensors and preserved APIs. That makes it a practical foundation for next‑generation research in air‑ground robotics, safer city operations, and embodied AI—and it’s available now for the community to build on.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.

Quantitative validation of spatial–temporal consistency: no measurement of cross-sensor timestamp skew, pose consistency, or per-tick jitter across aerial and ground sensors under load (e.g., many sensors/agents, high resolutions).
Physics fidelity of aerial dynamics: no benchmarking of multirotor models (e.g., SimpleFlight) against real flight logs (attitude, trajectory tracking, wind disturbances, ground effect, downwash).
Weather and wind modeling: absence of experiments validating wind fields, turbulence, gusts, and their effects on UAV dynamics and sensor outputs; unclear how to configure or simulate realistic aerodynamics-weather coupling.
Sensor realism and calibration: no validation of camera rolling shutter, motion blur, lens distortion, LiDAR scan timing, radar propagation artifacts, IMU noise/bias drift, GNSS multipath; lacks procedures to calibrate and verify sensor models.
Cross-modal registration accuracy: missing quantitative evaluation of extrinsic alignment between aerial and ground sensor suites (e.g., reprojection error, LiDAR–camera consistency) and tools for automatic cross-view calibration.
Scaling to multi-drone, high-density scenes: only one-drone experiments reported; no analysis of performance, scheduling, or stability with many drones (each at ~1 kHz physics), large traffic populations, and hundreds of sensors.
Pedestrian/traffic behavior in joint scenarios: authors note high-density behavior is an “active engineering target”; no evaluation of how pedestrians/vehicles react in the presence of drones or how performance degrades with dense actors.
Determinism and reproducibility: no evidence that synchronous mode yields bitwise- or numerically-deterministic runs across seeds/hardware; repeatability under multi-threaded aerial physics remains untested.
Long-duration stability beyond 3 hours: endurance test is limited (3 hours, 357 cycles); unknown behavior over multi-day training runs (memory fragmentation, VRAM drift, resource leaks, RPC stability).
Upper bounds on sensing throughput: while “up to 18 sensor modalities” is claimed, there is no characterization of maximum sustainable sensor counts per agent/world, high-FPS (e.g., 120 Hz) cameras, or 4K/8K resolutions.
Dataset construction specifics: missing details on ground-truth formats (3D boxes, instance IDs, dense semantics), synchronization/timestamps, calibration file export, and guarantees for cross-view correspondence correctness at each tick.
Sim2Real transfer evidence: no case studies or metrics demonstrating that policies/detectors trained in CARLA-Air transfer to real UAV/UGV systems; lacks domain randomization knobs and recommended ranges for reducing reality gap.
Physical interaction across domains: not evaluated whether drone downwash affects nearby actors/particles/vegetation, nor whether collisions and contact dynamics between aerial and ground agents are realistic and numerically stable.
Networking beyond loopback: latency/throughput measured only on localhost; no evaluation for remote clients, multi-user sessions, or LAN/WAN scenarios (packet loss, jitter) common in distributed training and teleoperation.
Headless/cloud execution: no results for offscreen rendering, containerized/cloud deployments, or multi-GPU servers; unclear performance and stability without a display, and under virtualization.
Cross-platform support: performance and compatibility are only reported for Ubuntu 20.04/UE4; Windows/macOS, diverse GPU drivers, and UE5 migration viability are not assessed.
Version compatibility matrix: ROS 2 and Python API compatibility is claimed but not enumerated; lacks a tested matrix of CARLA/AirSim/ROS 2/Python versions and guarantees across future upstream updates.
Concurrency and scheduling: aerial physics at ~1 kHz is described as “on a dedicated thread,” but no analysis of contention with UE render/physics threads, priority inversion risks, or how shared-tick semantics interact with high-rate sensors.
Real-time control budgets: no end-to-end latency accounting from sensor capture to control actuation at typical control rates (e.g., 100–400 Hz) to confirm feasibility for tight PID/MPC/VO pipelines.
Asset pipeline rigor: procedures to specify mass/inertia, propulsion constants, and controller gains for imported UAVs/UGVs are not detailed; lacks validation tools to catch inconsistent collision meshes or ill-conditioned dynamics.
Benchmark suite definition: the table claims a test suite, but the paper does not detail standardized scenarios, tasks, and metrics for air–ground cooperation, navigation, or perception—hindering reproducible comparisons.
GNSS/georeferencing realism: unknown support for geo-anchored environments, realistic GNSS errors (bias, multipath, urban canyons), and alignment with real-world coordinates/maps for outdoor robotics.
Sensor–weather coupling: untested whether rain/fog/snow affect LiDAR/radar/camera realistically (attenuation, backscatter, glare) and how parameters map to physically plausible conditions.
Resource sharing with on-box learning: although VRAM headroom is reported, no experiments co-running policy training (GPU-heavy) with the simulator to quantify interference, throughput, and stability.
Extending beyond multirotors: no support or roadmap for fixed-wing/VTOL/heli dynamics and sensors; unclear how easily aerial models can be generalized.
Failure handling and recovery: lack of evaluation of robustness to client crashes, network disconnects, or sensor/actor misconfigurations, and how the simulator recovers without manual intervention.
Evaluation fairness vs. co-simulation: Figure 1 reports IPC savings, but lacks a controlled, end-to-end comparison of task performance/accuracy (e.g., perception alignment, RL learning curves) between single-process and bridge-based setups.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following use cases can be deployed now, leveraging CARLA-Air’s single-process, synchronized air–ground simulation, preserved CARLA/AirSim Python APIs and ROS 2 interfaces, multi-modal sensing (up to 18 modalities), and extensible asset pipeline.

Sector: Robotics/Autonomy (industry, academia)
- Application: Rapid prototyping and evaluation of air–ground cooperative algorithms (e.g., cooperative surveillance, escort, search-and-rescue)
- Workflow/Tools: Reuse existing CARLA or AirSim codebases without modification; run closed-loop tests in synchronous mode; log strictly time-aligned aerial/ground sensor streams; parameter-sweep PIDs and planners; iterate via ROS 2
- Assumptions/Dependencies: Mid-range GPU/CPU to sustain ~20 FPS in joint workloads; sim-to-real gap requires domain randomization and field validation; pedestrian behavior under very high densities may need tuning
Sector: Software/AI (perception, mapping)
- Application: Synthetic multi-modal, cross-view dataset generation (paired aerial–ground RGB/depth/segmentation/LiDAR/radar/IMU/GNSS) for cross-view matching, 3D reconstruction, scene understanding
- Workflow/Tools: Configure synchronized sensor rigs on drones and vehicles; export auto-labeled data; vary weather/time-of-day; import custom maps/assets for target domains; create a “dataset factory” pipeline
- Assumptions/Dependencies: Asset licensing and realism of imported environments; manage domain gap (use style and photometric augmentations); storage/compression pipeline for large multi-sensor streams
Sector: AI/ML (reinforcement learning, multi-agent)
- Application: Cooperative RL training for aerial–ground policies in a single, shared-tick environment
- Workflow/Tools: Wrap CARLA-Air in Gym-like interfaces; use domain randomization for robustness; co-train policies for pursuit-evasion, target handoff, shared exploration; integrate with distributed RL frameworks
- Assumptions/Dependencies: Compute budget to maintain stable tick rates; careful seed control for reproducibility; curriculum design to mitigate exploration burden
Sector: Vision–Language–Action (academia, applied AI)
- Application: Vision-language navigation and action with complementary aerial overview + ground detail
- Workflow/Tools: Record synchronized video–state–language tuples; evaluate VLM/VLA agents on language-conditioned navigation, search, and manipulation proxies; build benchmarks with consistent ground truths
- Assumptions/Dependencies: External LLM/VLM stack integration; annotation pipelines for high-level instructions; evaluate generalization beyond synthetic visuals
Sector: Logistics/UAM (industry)
- Application: Drone delivery and landing-site prototyping amid realistic urban traffic and pedestrians
- Workflow/Tools: Model delivery routes, curbside interactions, and emergency landing strategies; stress-test perception and landing policies under varied weather and density; iterate with ROS 2 and Python APIs
- Assumptions/Dependencies: No built-in airspace/UTM module—operational rules and geofencing must be scripted; wind/turbulence beyond defaults may require plugins; results need real-world correlation
Sector: Software Engineering (DevOps/QA for autonomy stacks)
- Application: CI/CD regression testing of autonomy software (ROS 2 nodes, planning, perception)
- Workflow/Tools: Synchronous mode for determinism; headless runs in fixed scenarios; replayable seeds; assert on KPIs (collisions, waypoint latency, success rate); prebuilt binaries for standardized test runners
- Assumptions/Dependencies: Test harness integration; machine reproducibility (driver/OS versions); coverage of edge cases still curated by engineers
Sector: Education/Training (universities, bootcamps)
- Application: Dual-domain robotics labs (flight + driving) in one environment
- Workflow/Tools: Course modules on sensor fusion, localization, multirotor control, and multi-agent planning; students reuse CARLA/AirSim examples without code changes; evaluate policies with synchronized sensors
- Assumptions/Dependencies: Lab machines with discrete GPUs; onboarding to Unreal-based simulator; instructor-provided scenarios and rubrics
Sector: Geomatics/SLAM (academia, startups)
- Application: Benchmarking cross-view localization, SLAM, and 3D mapping with guaranteed pose ground truth
- Workflow/Tools: Collect paired aerial/ground sequences with LiDAR + vision; export pose graphs and map priors; test cross-view place recognition and aerial-assisted localization
- Assumptions/Dependencies: Synthetic geometry/materials vs. target domain; careful sensor noise modeling to match real hardware
Sector: Energy/Infrastructure Inspection (industry)
- Application: Simulated inspection missions using drone + UGV collaboration (e.g., substation, bridge)
- Workflow/Tools: Import CAD/meshed assets; set waypoint plans; evaluate viewpoint planning and defect detection models with controllable lighting and occlusion
- Assumptions/Dependencies: Quality of imported asset geometry and materials; specialized sensors (e.g., thermal) may need emulation extensions
Sector: HCI/Teleoperation (industry, UX research)
- Application: Operator-in-the-loop evaluations of multi-vehicle supervision (switching aerial/ground views)
- Workflow/Tools: Integrate custom UIs; record operator performance; evaluate camera placement, alerting, and autonomy handoff policies
- Assumptions/Dependencies: External UI stack; latency budget shaped by rendering + image transfer settings; human-subject protocols for studies
Sector: Public Safety (municipal agencies, vendors)
- Application: Procedural rehearsal for joint drone–ground emergency response (routing, deconfliction, crowd-aware navigation)
- Workflow/Tools: Script incidents, crowds, and traffic; measure response time and safety metrics; iterate tactics before field drills
- Assumptions/Dependencies: Ethical constraints; policy approximations for crowd behavior; scenario variability vs. doctrine alignment
Sector: Platform Migration (all sectors with AirSim/CARLA legacy)
- Application: Zero-modification migration of existing AirSim or CARLA projects into a unified air–ground setup
- Workflow/Tools: Keep native Python/ROS 2 APIs; validate synchronization and sensor timing; extend scenes incrementally with asset pipeline
- Assumptions/Dependencies: Pin toolchain/driver versions; verify coordinate-frame conversions for mixed stacks

Long-Term Applications

These use cases are enabled by CARLA-Air’s architecture but require additional research, scaling, validation, or ecosystem integrations.

Sector: Urban Air Mobility / UTM (industry, regulators, policy)
- Application: City-scale digital twins for low-altitude + ground robotics, with UTM integration and policy sandboxing
- Potential Tools/Products: “Urban Low-Altitude Ops Studio” combining CARLA-Air with traffic simulators and UTM services; batch what-if studies for corridor design, geofencing, and contingency handling
- Assumptions/Dependencies: Integration with UTM APIs and airspace rules engines; scalable multi-machine orchestration; validated behavior models for large crowds/traffic; regulator-accepted fidelity studies
Sector: Certification/Safety (policy, industry)
- Application: Certification-grade virtual validation of cooperative air–ground systems
- Potential Tools/Products: Scenario coverage libraries, hazard injection toolkits, formal requirement checkers, and traceability dashboards
- Assumptions/Dependencies: Statistically significant correlation with on-road/flight data; scenario taxonomies (e.g., PEGASUS-like) covering air–ground edge cases; V&V standards adoption by regulators
Sector: Logistics/Retail (industry)
- Application: End-to-end orchestration of last-mile air–ground delivery (drones + sidewalk robots) with dynamic dispatch and curb management
- Potential Tools/Products: “Air–Ground Delivery Orchestrator” for fleet simulation, scheduling, and micro-fulfillment layout optimization
- Assumptions/Dependencies: High-fidelity curb and pedestrian models; integration with mapping, inventory, and customer systems; real-world pilots for calibration
Sector: AI/Foundational Models (academia, tech)
- Application: Pretraining datasets for cross-view, multi-modal foundation models (aerial–ground video, depth, semantics, language)
- Potential Tools/Products: Pipelines that generate large-scale, time-synchronized corpora with automatic annotations and scripted language descriptions of events
- Assumptions/Dependencies: Photorealism and physics fidelity sufficient for transfer; scalable rendering farms; principled sim-to-real adaptation strategies
Sector: Emergency Management (public sector)
- Application: Live-data–assisted digital twins for disaster response (SAR, wildfire, flood) coordinating air and ground assets
- Potential Tools/Products: “SAR Ops Simulator” that ingests real-time GIS/weather feeds and predicts asset allocation strategies
- Assumptions/Dependencies: Environmental dynamics (smoke, wind, water) modeling; data assimilation pipelines; cross-agency interoperability
Sector: Telecom (industry)
- Application: Joint evaluation of 5G/6G connectivity for aerial–ground fleets (coverage, handoff, QoS under mobility)
- Potential Tools/Products: Coupled CARLA-Air + network simulators (e.g., ns-3) to study network-aware planning and adaptive bitrate for perception streams
- Assumptions/Dependencies: Tight co-simulation with network stack; timing fidelity at scale; realistic RF propagation models in urban canyons
Sector: Healthcare (hospitals, logistics)
- Application: Integrated hospital-campus autonomy (specimen/drug delivery by drones with AGV handoff)
- Potential Tools/Products: Workflow simulators for routing, scheduling, and sterile-chain compliance; capacity planning under peak demand
- Assumptions/Dependencies: Hospital IT integration; regulatory approvals; modeling of indoor–outdoor transitions and secure landing/pickup zones
Sector: Insurance/Finance (risk, underwriting)
- Application: Simulation-driven risk scoring for air–ground autonomous operations
- Potential Tools/Products: Scenario portfolios estimating incident probabilities under different policies, environments, and fleet mixes
- Assumptions/Dependencies: Calibrated incident models; access to claims/incident data for validation; acceptance by underwriting stakeholders
Sector: Sustainability/Energy (utilities, EPC)
- Application: Planning and optimizing inspection and maintenance of distributed assets (power lines, PV farms, pipelines) with joint air–ground teams
- Potential Tools/Products: Route planners that minimize time and carbon footprint while maintaining coverage and risk constraints
- Assumptions/Dependencies: Detailed asset libraries and terrain; integration with enterprise asset management; emissions models
Sector: Training/Certification (consumer, enterprise)
- Application: Operator training suites for supervising heterogeneous fleets in dense urban settings
- Potential Tools/Products: Scenario-based training with performance analytics and certifications; progressive difficulty and failure injection
- Assumptions/Dependencies: Training content development; human factors validation; hardware-in-the-loop or device integration for realism
Sector: Benchmark Ecosystem (academia, community)
- Application: Standardized, community-maintained benchmarks for air–ground embodied intelligence
- Potential Tools/Products: Public leaderboards for cooperative navigation, perception, and VLA tasks with synchronized metrics and datasets
- Assumptions/Dependencies: Governance and maintenance; agreed task definitions and metrics; compute resource sponsorship

Cross-Cutting Assumptions and Dependencies

Hardware/OS: Linux workstations with discrete GPUs (e.g., RTX-series) for smooth joint workloads; driver and Unreal Engine compatibility pinned.
Fidelity and Transfer: Synthetic-to-real transfer remains a limiting factor; employ domain randomization, sensor-noise modeling, and targeted real-world validation.
Scaling: Current single-process design eliminates IPC overhead but may require architectural extensions for city-scale, multi-machine simulations.
Ecosystem Integrations: UTM, RF/network, aeroacoustics, and advanced weather/hazard dynamics require coupling with specialized simulators.
Maintenance: CARLA-Air actively extends AirSim within a modern infrastructure; sustained community and maintainer support will underpin long-term viability.

View Paper Prompt View All Prompts

Glossary

Air-ground cooperation: Coordinated operation between aerial and ground robots to achieve joint tasks. "Air-ground cooperation---heterogeneous aerial and ground agents coordinate within a shared environment for tasks such as cooperative surveillance, escort, and search-and-rescue."
air-ground embodied intelligence: Research area focused on agents that perceive and act jointly across air and ground domains in a shared physical world. "CARLA-Air, a unified simulation infrastructure for air-ground embodied intelligence."
asset pipeline: The set of tools and processes to import and integrate custom robots, vehicles, and maps into the simulator. "An extensible asset pipeline further allows researchers to integrate custom robot platforms, UAV configurations, and environment maps into the shared simulation world."
autopilot: Automated control mode for vehicles that drives them without manual input. "8 autopilot vehicles + 1 drone; 1 aerial RGB @ $1920\!\times\!1080$ "
barometry: Measurement of atmospheric pressure for altitude estimation in robotics. "RGB, depth, semantic segmentation, LiDAR, radar, IMU, GNSS, and barometry"
BeginPlay: A Unreal Engine lifecycle event triggered when gameplay starts, commonly used to initialize actors. "composed in BeginPlay"
bridge-based co-simulation: An approach that links separate simulators via communication bridges to run together. "Bridge-based co-simulation can connect heterogeneous backends, yet introduces synchronization overhead"
closed-loop interaction: A control setting where agents act, receive feedback, and update actions continuously within the environment. "agents learn cooperative or individual policies through closed-loop interaction in physically consistent air-ground environments."
cross-process serialization: Packaging and transferring data across process boundaries, often adding latency. "Bridge-based co-simulation~\cite{transimhub} exhibits near-linear growth with sensor count due to cross-process serialization"
decoupled execution: Running subsystems without a shared synchronization tick, allowing independent timing. "Sync Mode: Msg.\,=\,message passing, Decpl.\,=\,decoupled execution, Shared\,=\,shared-tick within one process."
flight pawn: An Unreal Engine controllable entity representing the aircraft for physics and control. "flight pawn"
GNSS: Global Navigation Satellite System, providing global positioning and timing (e.g., GPS, GLONASS). "RGB, depth, semantic segmentation, LiDAR, radar, IMU, GNSS, and barometry"
GPU-accelerated reinforcement learning: RL training that exploits GPU parallelism for faster simulation or learning. "Isaac Lab~\cite{isaaclab} and Isaac Gym~\cite{isaacgym} emphasize massively parallel GPU-accelerated reinforcement learning"
GPU memory bandwidth saturation: Performance limit reached when GPU memory transfer capacity is the bottleneck. "Sensor rendering dominates at high resolution due to GPU memory bandwidth saturation"
harmonic mean: A mean for rate quantities that reduces the impact of large values. "All reported frame rates are the harmonic mean of $\mathbf{f}$ "
handedness (of coordinate frames): The orientation convention (left- or right-handed) used to define axes in 3D space. "negating $q_z$ accounts for the Z-axis reversal and the associated change of frame handedness."
IMU: Inertial Measurement Unit, providing acceleration and angular velocity for state estimation. "RGB, depth, semantic segmentation, LiDAR, radar, IMU, GNSS, and barometry"
interquartile range (IQR): A robust measure of statistical dispersion between the 25th and 75th percentiles. "Round-trip API call latency on the loopback interface (median $\pm$ IQR; 5\,000 calls after 500 warm-up; RTX~A4000; idle scene)."
left-handed system: A coordinate convention where axes follow left-hand orientation (as used by UE4). "CARLA inherits UE4's left-handed system with X forward, Y right, and Z up, in centimeters."
LiDAR: Light Detection and Ranging sensor that measures distances using laser pulses. "RGB, depth, semantic segmentation, LiDAR, radar, IMU, GNSS, and barometry"
loopback interface: A network interface that routes traffic back to the same machine (e.g., 127.0.0.1). "Both simulation APIs operate within the same process on the loopback interface, eliminating inter-process serialization overhead."
message-passing middleware: Software layer for exchanging messages between processes or systems. "message-passing middleware across independent processes."
multirotor: A UAV with multiple rotors for lift and control (e.g., quadrotor). "physics-accurate multirotor flight"
North-East-Down (NED) frame: A right-handed geographic coordinate frame with axes pointing North, East, and downward. "AirSim adopts a right-handed North-East-Down (NED) frame with X north, Y east, and Z down, in meters."
photogrammetry: Technique to reconstruct scenes from images, often used to build realistic environments. "FlightGoggles~\cite{flightgoggles} provides photogrammetry-based environments"
photorealistic: Rendering that closely resembles real-world appearance. "CARLA~\cite{carla}, built on Unreal Engine~4~\cite{ue4}, has become the de facto standard for urban autonomous driving research, offering photorealistic environments"
physics tick: The discrete timestep at which physics simulation updates occur. "Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic urban and natural environments"
PID gains: Proportional–Integral–Derivative controller parameters tuning system response. "All aerial experiments use the built-in SimpleFlight controller with default PID gains."
pose transform: The mathematical mapping between positions and orientations across coordinate frames. "Eqs.~(\ref{eq:pos_transform}) and~(\ref{eq:rot_transform}) together fully specify the pose transform"
unit quaternion: A normalized quaternion representing 3D orientation without singularities. "Let $q = (w, q_x, q_y, q_z)$ denote a unit quaternion in the UE4 frame."
render-target caching: Reuse of GPU render buffers to avoid reallocation costs across frames. "the negligible early-to-late drift ... is attributable to residual render-target caching rather than lifecycle leakage."
rendering pipeline: The sequence of GPU stages that produce images from 3D scene data. "Shared UE4 Rendering Pipeline"
reinforcement-learning-based policy training: Learning control policies via reward signals through interaction with the environment. "Reinforcement-learning-based policy training---agents learn cooperative or individual policies through closed-loop interaction in physically consistent air-ground environments."
right-handed frame: A coordinate convention where axes follow right-hand orientation. "AirSim adopts a right-handed North-East-Down (NED) frame with X north, Y east, and Z down, in meters."
ROS 2: Robot Operating System 2, a middleware framework for robotic communication and control. "ROS\,2 interfaces"
RPC server: Remote Procedure Call server that handles requests from clients to invoke functions across processes or over a network. "Two independent RPC servers run concurrently within the single process---one per simulator---allowing the native Python clients of each platform to connect without modification."
semantic segmentation: Vision task assigning a class label to each pixel in an image. "RGB, depth, semantic segmentation, LiDAR, radar, IMU, GNSS, and barometry"
sensor modalities: Distinct types of sensor data streams (e.g., RGB, depth, LiDAR). "up to 18 sensor modalities---including RGB, depth, semantic segmentation, LiDAR, radar, IMU, GNSS, and barometry---across all aerial and ground platforms at each simulation tick."
shared-tick: A synchronization mode where systems advance using the same discrete time step. "Shared\,=\,shared-tick within one process."
single-process architecture: Design where all components run within one OS process, avoiding IPC overheads. "CARLA-Air remains effectively constant ( $<0.5$ \,ms) owing to its single-process architecture."
spatial-temporal consistency: Strict alignment in space and time across sensors and subsystems. "cannot guarantee the strict spatial-temporal consistency required by modern perception and learning pipelines."
synchronous mode: Simulation mode where time advances in discrete, externally triggered steps for determinism. "Under synchronous-mode operation, per-tick wall time is bounded by the slowest of three concurrent contributors"
synchronization overhead: Extra time or resources required to keep different systems aligned in time. "Bridge-based co-simulation can connect heterogeneous backends, yet introduces synchronization overhead"
UAV: Unmanned Aerial Vehicle, commonly referred to as a drone. "aerodynamically consistent UAV dynamics"
UE4: Unreal Engine 4, a game engine used as the simulation backend. "UE4 enforces a strict invariant: each world may have exactly one active game mode."
UE4 Game Mode Slot: The single per-world slot in UE4 that hosts the active game mode class. "UE4 Game Mode Slot"
vision-language action: Control of agents using combined visual and natural language inputs to specify tasks. "Embodied navigation and vision-language action---agents navigate and act grounded in visual and linguistic input"
VRAM: Video RAM on the GPU used for textures, buffers, and render targets. "VRAM is sampled every 60\,s."
wire protocols: Defined formats and rules for encoding data transmitted between systems. "The two RPC servers use distinct wire protocols and port assignments."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Collections

GitHub

GitHub - louiszengCN/CarlaAir: CarlaAir: Fly Drones Inside a CARLA World!! A Unified Infrastructure for Air-Ground Embodied Intelligence · GitHub (195 stars)

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Summary

CARLA-Air: A Unified Simulation Infrastructure for Air-Ground Embodied Intelligence

Motivation and Problem Statement

Architectural Contributions

Coordinated Multi-Modal Sensing and Evaluation

Representative Workflows and Applications

Extensibility and Asset Integration

Strong Empirical Results and Claims

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about

What questions the authors asked

How they built it (in everyday terms)

What they found and why it matters

Why this is important

The bottom line

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-Cutting Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Collections

GitHub

Tweets