AI2 2.0 Simulation Environment

Updated 27 January 2026

AI2-THOR is an interactive 3D simulation environment featuring photo-realistic indoor scenes for navigation, object interaction, and embodied AI research.
The platform integrates a Unity-based physics engine with a Python API to support extensive sensor modalities and customizable scene configurations.
Its modular and extensible design facilitates research in deep reinforcement learning, imitation learning, and complex multi-agent tasks.

AI2 2.0 Simulation Environment (AI2-THOR) is an interactive 3D simulation platform designed for research in visual AI, providing agents with photo-realistic indoor environments for navigation and object interaction. The system supports a wide range of research areas, including deep reinforcement learning, imitation learning, visual question answering, and embodied AI, with extensibility for custom scenes, tasks, and integration into machine learning pipelines. The environment combines a Python API and a Unity-based physics simulation with extensive sensor modalities and a comprehensive catalog of interactive objects and scenes (Kolve et al., 2017).

1. System Design and Architecture

AI2-THOR integrates a physics-based 3D Unity backend (Unity 2020.3 LTS, NVIDIA PhysX 3.4) with a Python frontend distributed as the ai2thor pip package. Communication is mediated locally via HTTP/ZeroMQ JSON messaging between the Python controller process and the Unity environment. For large-scale and headless deployments, Unity’s batch renderer and xvfb are supported on Linux.

The environment organizes its simulation assets into four major scene collections, each tailored to distinct experimental needs:

Collection	#Scenes	Purpose
iTHOR	120	Hand-modeled common rooms (kitchen, etc.)
RoboTHOR	89	Sim2real transfer, lab-like apartments
ProcTHOR-10K	10,000	Large-scale procedurally-generated homes
ArchitecTHOR	10	Evaluation (5 validation, 5 test)

Scenes contain a total of approximately 3,578 interactive object models across 117+ categories (e.g., mugs, toasters, doors). Internal physics operate at a fixed 1/90 s time step, using convex mesh colliders and continuous collision detection; gravity is $9.81\,\mathrm{m/s}^2$ with default PhysX friction and restitution. Object interactions (e.g., open, cook, slice) employ Unity components orchestrating kinematic state changes and animation.

2. Agent Interface: Sensing and Actuation

AI2-THOR agents interact through a multi-modal sensor suite (per camera):

RGB: Configurable image resolution (default: $300 \times 300$ , 8-bit per channel)
Depth: 16-bit float per-pixel depth maps (in meters)
Semantic Segmentation: Per-pixel class identifier ($0$ = background)
Instance Segmentation: Per-pixel unique object identifier
Surface Normals: Per-pixel surface orientation (XYZ float)

Action spaces are segmented as follows:

Discrete Navigation: MoveAhead ( $\textrm{moveMagnitude} \in [0,1]$ m), Rotate{Left,Right} ( $\textrm{rotateMagnitude} \in [0,180]^\circ$ ), Look{Up,Down} ( $\textrm{lookMagnitude} \in [0,90]^\circ$ )
Continuous Navigation: Move(float), Rotate(float), Teleport(x, y, z, yaw, pitch)
Arm Manipulation: (ManipulaTHOR, StretchRE1) via IK-joint targeting and gripper control—restricted to objects within reach and line-of-sight
Abstracted Interactions: (LoCoBot, Abstract, Drone) Pickup, Open, Place, Drop, Push, Throw allowed for objects within 1m and FOV

Agent state and scene context are provided through a structured state representation after every action (an Event), containing:

event.frame: RGB image
event.depth_frame: Depth map
event.instance_segmentation_frame: Instance mask
event.semantic_segmentation_frame: Class mask
event.third_party_normals: Surface normals
event.metadata: Dict covering all visible, open, moving objects, their transforms, the agent’s own pose/action status, and current scene identity/geometry

3. API Workflow and Configuration

Installation is available via PyPI (pip install ai2thor). A basic experiment loop consists of initializing a Controller object with custom scene, grid, and rendering options, invoking reset() for an episode start, and repeatedly calling step():

from ai2thor.controller import Controller
c = Controller(scene='FloorPlan1_physics',
               gridSize=0.25,
               width=300, height=300,
               renderDepth=True,
               renderClassImage=True,
               renderInstanceSegmentation=True)
ev = c.reset()
ev = c.step(dict(action='MoveAhead', moveMagnitude=0.25))
rgb = ev.frame
depth = ev.depth_frame

Scene and object manipulations (addition or removal) are supported at runtime using step actions, e.g.:

1
2
3

c.step(dict(action='AddObject', name='Apple', objectId='apple_01',
            position={'x':1,'y':0,'z':1}))
c.step(dict(action='RemoveObject', objectId='apple_01'))

Configuration controls include randomSeed, lightingVariation, randomizeMaterials for domain adaptation, and custom timeouts or reward parameters for RL tasks (with AllenAct integration).

4. Supported Tasks, Benchmarks, and Metrics

AI2-THOR is provisioned with configured tasks supporting:

ObjectNav / Target-Driven Navigation: Navigate to a visually specified target
Audio-Visual Navigation: Locate sounding object
Point-Goal Navigation: Move to X,Y,Z coordinate
Interactive Question Answering (IQA)
Instruction Following: ALFRED, TEACh, DialFRED
Pick-and-Place, Rearrangement (RoomR)
Multi-agent collaboration
Arm-Based Manipulation: (e.g., grasp, open, slice, pour with ManipulaTHOR)

Reward and evaluation metrics include:

Distance-based reward:

$r_t = \alpha (d_{t-1} - d_t)$

with success bonus $r_{\mathrm{success}} = +1$ when $d_t < \epsilon$

Success-weighted by inverse path length (SPL):

$\mathrm{SPL} = \frac{1}{N}\sum_{i=1}^N s_i \frac{\ell_i}{\max(p_i, \ell_i)}$

where $s_i \in \{0,1\}$ , $p_i$ = path length, $\ell_i$ = shortest path

Performance metrics observed (e.g., ObjectNav, 28 processes, 2xRTX2080):

ProcTHOR-10K scenes: 145–179 FPS ( $\overline{167}$ FPS)
API step latency: 3–7ms (single process)
ObjectNav in iTHOR: success rate $\sim$ 60–70% (SPL $\sim$ 0.45)

5. Extensibility and Customization

AI2-THOR’s architecture facilitates the creation of new environments and research tasks:

Adding Scenes/Objects:

Import FBX models in Unity Editor
Create prefab with collider, rigidbody, “Interactable” Unity component
Insert into scene file (.unity)
Update scene bounds/objects JSON as needed
Rebuild asset bundles and restart Controller

Custom Tasks/Rewards: Implemented by subclassing ai2thor.controller.Event or an AllenAct Task, overriding step() to parse ev.metadata, and integrating with RL libraries (e.g., stable-baselines3, RLlib) via Gym Env wrappers.
Gym-Style Integration:

1 2	from ai2thor.contrib.gym import make env = make('AI2-THOR-Task-v0', config=dict(scene='FloorPlan1_physics', ...))

Trainable with any OpenAI Gym-compatible algorithm.

6. Deployment Requirements and Limitations

Minimum software dependencies include: Python $\geq$ 3.6, ai2thor $\geq$ 2.8.0, numpy, Pillow, Unity 2020.3 LTS (for custom asset builds), and optionally AllenAct for RL tasks.

Hardware recommendations:

Resource	Recommended Minimum
GPU	NVIDIA GTX 1080 Ti; $\geq$ 8GB VRAM
CPU	Quad-core $\geq$ 2.5 GHz
System RAM	$\geq$ 16GB
OS	Ubuntu 18.04/20.04, Windows 10, macOS

On Linux, headless launching via:

1
2
3

sudo apt install xvfb libxi6 libgconf-2-4
pip install ai2thor
xvfb-run -s "-screen 0 1280x720x24" python your_script.py

Known constraints:

Scene reset incurs $\sim$ 50–100 ms overhead; batching advisable.
No native simulation of deformable objects (cloth, complex fluids).
Continuous arm manipulation requires tuning of collision margins for stability.
Multi-agent synchronization must be handled in user logic; asynchronous multi-agent stepping is not natively supported.

7. Context and Significance

AI2-THOR’s comprehensive scene library, tightly coupled Python/Unity stack, and flexible agent/action models have enabled a range of research in deep RL, visual reasoning, navigation, and embodied cognition (Kolve et al., 2017). Its out-of-the-box benchmarks (e.g., ALFRED, RoomR, IQA, AudioNav) and open extensibility have established it as a major platform for reproducible and scalable embodied AI experimentation. The inclusion of procedural scene generation (ProcTHOR-10K) and standard interfaces for RL RL frameworks (AllenAct, Gym) further support large-scale, statistically robust evaluation and rapid prototyping. The design choices and known limitations suggest active areas of development, particularly regarding multi-agent support, real-time performance scaling, and sim-to-real transfer capabilities.

Markdown Report Issue Upgrade to Chat

References (1)

AI2-THOR: An Interactive 3D Environment for Visual AI (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AI2 2.0 Simulation Environment.