AI2 2.0 Simulation Environment
- AI2-THOR is an interactive 3D simulation environment featuring photo-realistic indoor scenes for navigation, object interaction, and embodied AI research.
- The platform integrates a Unity-based physics engine with a Python API to support extensive sensor modalities and customizable scene configurations.
- Its modular and extensible design facilitates research in deep reinforcement learning, imitation learning, and complex multi-agent tasks.
AI2 2.0 Simulation Environment (AI2-THOR) is an interactive 3D simulation platform designed for research in visual AI, providing agents with photo-realistic indoor environments for navigation and object interaction. The system supports a wide range of research areas, including deep reinforcement learning, imitation learning, visual question answering, and embodied AI, with extensibility for custom scenes, tasks, and integration into machine learning pipelines. The environment combines a Python API and a Unity-based physics simulation with extensive sensor modalities and a comprehensive catalog of interactive objects and scenes (Kolve et al., 2017).
1. System Design and Architecture
AI2-THOR integrates a physics-based 3D Unity backend (Unity 2020.3 LTS, NVIDIA PhysX 3.4) with a Python frontend distributed as the ai2thor pip package. Communication is mediated locally via HTTP/ZeroMQ JSON messaging between the Python controller process and the Unity environment. For large-scale and headless deployments, Unity’s batch renderer and xvfb are supported on Linux.
The environment organizes its simulation assets into four major scene collections, each tailored to distinct experimental needs:
| Collection | #Scenes | Purpose |
|---|---|---|
| iTHOR | 120 | Hand-modeled common rooms (kitchen, etc.) |
| RoboTHOR | 89 | Sim2real transfer, lab-like apartments |
| ProcTHOR-10K | 10,000 | Large-scale procedurally-generated homes |
| ArchitecTHOR | 10 | Evaluation (5 validation, 5 test) |
Scenes contain a total of approximately 3,578 interactive object models across 117+ categories (e.g., mugs, toasters, doors). Internal physics operate at a fixed 1/90 s time step, using convex mesh colliders and continuous collision detection; gravity is with default PhysX friction and restitution. Object interactions (e.g., open, cook, slice) employ Unity components orchestrating kinematic state changes and animation.
2. Agent Interface: Sensing and Actuation
AI2-THOR agents interact through a multi-modal sensor suite (per camera):
- RGB: Configurable image resolution (default: , 8-bit per channel)
- Depth: 16-bit float per-pixel depth maps (in meters)
- Semantic Segmentation: Per-pixel class identifier ($0$ = background)
- Instance Segmentation: Per-pixel unique object identifier
- Surface Normals: Per-pixel surface orientation (XYZ float)
Action spaces are segmented as follows:
- Discrete Navigation:
MoveAhead( m),Rotate{Left,Right}(),Look{Up,Down}() - Continuous Navigation:
Move(float),Rotate(float),Teleport(x, y, z, yaw, pitch) - Arm Manipulation: (ManipulaTHOR, StretchRE1) via IK-joint targeting and gripper control—restricted to objects within reach and line-of-sight
- Abstracted Interactions: (LoCoBot, Abstract, Drone)
Pickup,Open,Place,Drop,Push,Throwallowed for objects within 1m and FOV
Agent state and scene context are provided through a structured state representation after every action (an Event), containing:
event.frame: RGB imageevent.depth_frame: Depth mapevent.instance_segmentation_frame: Instance maskevent.semantic_segmentation_frame: Class maskevent.third_party_normals: Surface normalsevent.metadata: Dict covering all visible, open, moving objects, their transforms, the agent’s own pose/action status, and current scene identity/geometry
3. API Workflow and Configuration
Installation is available via PyPI (pip install ai2thor). A basic experiment loop consists of initializing a Controller object with custom scene, grid, and rendering options, invoking reset() for an episode start, and repeatedly calling step():
1 2 3 4 5 6 7 8 9 10 11 |
from ai2thor.controller import Controller c = Controller(scene='FloorPlan1_physics', gridSize=0.25, width=300, height=300, renderDepth=True, renderClassImage=True, renderInstanceSegmentation=True) ev = c.reset() ev = c.step(dict(action='MoveAhead', moveMagnitude=0.25)) rgb = ev.frame depth = ev.depth_frame |
Scene and object manipulations (addition or removal) are supported at runtime using step actions, e.g.:
1 2 3 |
c.step(dict(action='AddObject', name='Apple', objectId='apple_01', position={'x':1,'y':0,'z':1})) c.step(dict(action='RemoveObject', objectId='apple_01')) |
Configuration controls include randomSeed, lightingVariation, randomizeMaterials for domain adaptation, and custom timeouts or reward parameters for RL tasks (with AllenAct integration).
4. Supported Tasks, Benchmarks, and Metrics
AI2-THOR is provisioned with configured tasks supporting:
- ObjectNav / Target-Driven Navigation: Navigate to a visually specified target
- Audio-Visual Navigation: Locate sounding object
- Point-Goal Navigation: Move to X,Y,Z coordinate
- Interactive Question Answering (IQA)
- Instruction Following: ALFRED, TEACh, DialFRED
- Pick-and-Place, Rearrangement (RoomR)
- Multi-agent collaboration
- Arm-Based Manipulation: (e.g., grasp, open, slice, pour with ManipulaTHOR)
Reward and evaluation metrics include:
- Distance-based reward:
with success bonus when
- Success-weighted by inverse path length (SPL):
where , = path length, = shortest path
Performance metrics observed (e.g., ObjectNav, 28 processes, 2xRTX2080):
- ProcTHOR-10K scenes: 145–179 FPS ( FPS)
- API step latency: 3–7ms (single process)
- ObjectNav in iTHOR: success rate 60–70% (SPL 0.45)
5. Extensibility and Customization
AI2-THOR’s architecture facilitates the creation of new environments and research tasks:
- Adding Scenes/Objects:
- Import FBX models in Unity Editor
- Create prefab with collider, rigidbody, “Interactable” Unity component
- Insert into scene file (
.unity) - Update scene bounds/objects JSON as needed
- Rebuild asset bundles and restart Controller
- Custom Tasks/Rewards: Implemented by subclassing
ai2thor.controller.Eventor an AllenAct Task, overridingstep()to parseev.metadata, and integrating with RL libraries (e.g., stable-baselines3, RLlib) via Gym Env wrappers. - Gym-Style Integration:
1 2 |
from ai2thor.contrib.gym import make env = make('AI2-THOR-Task-v0', config=dict(scene='FloorPlan1_physics', ...)) |
6. Deployment Requirements and Limitations
Minimum software dependencies include: Python 3.6, ai2thor2.8.0, numpy, Pillow, Unity 2020.3 LTS (for custom asset builds), and optionally AllenAct for RL tasks.
Hardware recommendations:
| Resource | Recommended Minimum |
|---|---|
| GPU | NVIDIA GTX 1080 Ti; 8GB VRAM |
| CPU | Quad-core 2.5 GHz |
| System RAM | 16GB |
| OS | Ubuntu 18.04/20.04, Windows 10, macOS |
On Linux, headless launching via:
1 2 3 |
sudo apt install xvfb libxi6 libgconf-2-4 pip install ai2thor xvfb-run -s "-screen 0 1280x720x24" python your_script.py |
Known constraints:
- Scene reset incurs 50–100 ms overhead; batching advisable.
- No native simulation of deformable objects (cloth, complex fluids).
- Continuous arm manipulation requires tuning of collision margins for stability.
- Multi-agent synchronization must be handled in user logic; asynchronous multi-agent stepping is not natively supported.
7. Context and Significance
AI2-THOR’s comprehensive scene library, tightly coupled Python/Unity stack, and flexible agent/action models have enabled a range of research in deep RL, visual reasoning, navigation, and embodied cognition (Kolve et al., 2017). Its out-of-the-box benchmarks (e.g., ALFRED, RoomR, IQA, AudioNav) and open extensibility have established it as a major platform for reproducible and scalable embodied AI experimentation. The inclusion of procedural scene generation (ProcTHOR-10K) and standard interfaces for RL RL frameworks (AllenAct, Gym) further support large-scale, statistically robust evaluation and rapid prototyping. The design choices and known limitations suggest active areas of development, particularly regarding multi-agent support, real-time performance scaling, and sim-to-real transfer capabilities.