Scene Representation Transformer (SRT)

Updated 13 January 2026

Scene Representation Transformer is a framework that converts multi-view and multi-agent sensory data into geometry-aware, permutation-invariant scene representations.
It employs advanced self-attention mechanisms, including relative pose encoding and localized attention, to fuse unordered sets of image patches and agent histories efficiently.
Variants like LVT, OSRT, and ASTRA extend the core design for scalable 3D reconstruction and trajectory forecasting, achieving significant improvements in fidelity and sample efficiency.

The Scene Representation Transformer (SRT) and its descendants constitute a foundational methodological class for learning abstract, geometry-aware, and interaction-aware scene representations directly from sensory data—most commonly for vision (novel view synthesis and 3D reconstruction) and autonomous agent prediction (trajectory forecasting, collaborative decision-making). The architecture generalizes conventional vision transformer designs to operate on unordered sets or graphs of images, agents, or grid-based scene elements, integrating global and local spatial information for downstream rendering, planning, or prediction at interactive rates and with rigorous equivariance properties (Sajjadi et al., 2021, Safin et al., 2023, Imtiaz et al., 29 Sep 2025, Liu et al., 2022, Teeti et al., 16 Jan 2025, Sajjadi et al., 2022, Hu et al., 2024, Ngiam et al., 2021).

1. Core Architectural Principles

Scene Representation Transformers process sets (or graphs) of input entities—images, agents, or grid cells—without explicit geometric priors or volumetric raymarching. In canonical SRT for vision (Sajjadi et al., 2021), a shared CNN extracts patchwise embeddings from each input image (posed or unposed). These patch tokens are fused via multi-head self-attention into a permutation-invariant set-latent scene representation (SLSR), generalizing the standard Vision Transformer (ViT) to unordered multi-view sets. When camera poses are available, per-patch positional encodings are appended, either globally (origin and direction in a reference frame) or, in advanced variants, using pairwise relative pose embeddings (Safin et al., 2023, Imtiaz et al., 29 Sep 2025).

For novel view synthesis, a Transformer decoder treats query rays as set-valued entities; each decoder block performs cross-attention from ray embedding queries to the scene latent tokens. Output heads parameterize color values directly (light-field) or, in variants like LVT, decode distributions over 3D splats for Gaussians (Imtiaz et al., 29 Sep 2025).

In agent-centric domains (e.g., autonomous driving, pedestrian forecasting), SRT operates analogously on sets or graphs of agent histories, road features, or interaction maps, employing axis-factored multi-head attention (Ngiam et al., 2021), graph-aware biases (Teeti et al., 16 Jan 2025), and dedicated cross-modality blocks (Liu et al., 2022).

2. Positional Encoding and Geometric Invariance

Positional encoding is critical for enabling spatial reasoning in SRT. The original model (Sajjadi et al., 2021) concatenated sinusoidal ray encodings of camera origin and direction in a chosen reference frame to each patch embedding, introducing explicit dependence on a global scene coordinate system.

This approach, however, proved non-invariant under reference-frame changes, causing instability for large-scale or sweeping scenes (Safin et al., 2023). Relative Pose Attention (RePAST) (Safin et al., 2023) resolves this by injecting pairwise relative camera pose information directly into every attention calculation. Letting $P_i$ and $P_j$ denote SE(3) extrinsics for views $i$ and $j$ , the relative pose $P_{i \rightarrow j} = P_j^{-1} P_i$ is decomposed into rotation and translation, typically as a unit quaternion and 3D vector. Sinusoidal positional encodings are computed over $(q, t)$ and projected via MLP to produce local context vectors. This ensures coordinate-system invariance and stabilizes multi-view fusion for arbitrarily large or unbounded scenes (Safin et al., 2023, Imtiaz et al., 29 Sep 2025).

In LVT (Imtiaz et al., 29 Sep 2025), relative pose encoding is used similarly, but restricted to a local window of $w$ nearest-view neighbors, enabling scalable, linearly complex attention.

3. Computational Scaling: Local Attention and Linear Complexity

Standard transformer self-attention scales quadratically in the number of tokens and number of views, becoming intractable for high-resolution, large-scale scenes. LVT (Imtiaz et al., 29 Sep 2025) circumvents this via neighborhood-restricted attention: for each view, only tokens from the $w$ nearest neighboring views (plus self) are attended to at each layer. This yields $O(N_i w (\frac{HW}{p^2})^2)$ complexity per layer, linear in the number of views $N_i$ , making interactive, high-resolution entire-scene reconstruction feasible.

LVT alternates neighborhood-restricted multi-head attention and feedforward layers in deep stacks ( $L=24$ , $h=16$ heads, $d=1024$ ), gradually propagating global context despite the locality of each operation.

In trajectory and decision-making domains, agent-centric SRTs employ similar axis-factored attention or restrict graph edges by motion proximity for efficient encoding of local social interactions and spatial constraints (Teeti et al., 16 Jan 2025, Hu et al., 2024, Ngiam et al., 2021).

4. Scene Representation Decoding

After fusion, the decoder predicts outputs for specified downstream tasks:

Novel View Synthesis (Vision):
- The decoded representation parameterizes RGB values for arbitrary query rays by direct (light-field) mapping (Sajjadi et al., 2021, Safin et al., 2023).
- In LVT, tokens are "unpatchified" to pixel regions and output parameters for 3D Gaussian splats: means, covariances, SH color, and SH opacity coefficients (Imtiaz et al., 29 Sep 2025). Rendering aggregates splats via a differentiable front-to-back rasterizer.
- OSRT (Sajjadi et al., 2022) and object-centric models use Slot Attention to decompose the scene latent into object slots, which are then pooled via a Slot Mixer transformer per ray for compositional rendering.
Agent Behavior Prediction:
- SRTs in trajectory prediction and decision making (ASTRA (Teeti et al., 16 Jan 2025), SRT for RL (Liu et al., 2022), GITSR (Hu et al., 2024)) decode per-agent future trajectories or joint actions by feeding agent–scene fused embeddings into MLP or RL heads. Stochastic variants encapsulate multimodal possible futures via CVAEs.

5. Training Objectives and Sample Efficiency

Training is end-to-end via per-pixel or per-agent supervision:

Novel view synthesis is supervised with L2, PSNR, and perceptual losses, sometimes with regularizers for view-dependent opacity (Imtiaz et al., 29 Sep 2025, Sajjadi et al., 2021).
Decision-making models employ reward-augmented RL objectives, typically via Soft Actor-Critic (SAC) or DQN heads wrapped around Transformer-encoded scene/agent latents (Liu et al., 2022, Hu et al., 2024).
SLT (Sequential Latent Transformer) (Liu et al., 2022) imposes self-supervised future-consistency losses to drive latent representations to be maximally predictive over $T_G$ steps, substantially reducing exploration space and accelerating convergence.
CVAE heads (ASTRA (Teeti et al., 16 Jan 2025)) enable stochastic trajectory prediction; weighted loss schedules upweight start/end errors to prevent drift.

SRT-based models consistently demonstrate marked improvements in sample efficiency, output fidelity, and generalizability across diverse domains. LVT achieves 3.54 dB PSNR improvement over global attention baselines at linear computational cost (Imtiaz et al., 29 Sep 2025); agent-centric SRTs reduce collision rates and task completion times in urban scenarios (Liu et al., 2022, Hu et al., 2024).

6. Variants and Extensions

Numerous SRT variants adapt the core principle to different modalities and requirements:

Variant	Domain	Key Innovation
SRT (Sajjadi et al., 2021)	Novel view synthesis	Set-latent scene encoding, global pose, light-field decoder
RePAST (Safin et al., 2023)	View synthesis	Relative pose attention, invariance to global frame
LVT (Imtiaz et al., 29 Sep 2025)	Large-scale 3D reconstr.	Neighborhood-restricted attention, 3D Gaussian splats
OSRT (Sajjadi et al., 2022)	Object-centrism, comp.	Slot Attention + Slot Mixer compositionality
Scene Transformer (Ngiam et al., 2021)	Joint agent motion	Axis-factored (agent/time), mask-based task unification
ASTRA (Teeti et al., 16 Jan 2025)	Pedestrian forecasting	U-Net feature extractor, graph bias, stochastic CVAE
SRT-RL (Liu et al., 2022)	Driving/policy learning	Multi-stage encoder, SLT future distillation, SAC/RL head
GITSR (Hu et al., 2024)	Multi-vehicle RL	Occupation grids + GNN, feasible region via MHA

This suggests the SRT paradigm is readily extensible: local attention schemes for scalable 3D reconstructions, graph-aware modules for social interaction, slot-mixing decoders for compositionality, and multimodal heads for uncertainty.

7. Quantitative Benchmarks and Limitations

Empirical results corroborate SRT's superiority across multiple datasets and metrics:

SRT achieves state-of-the-art PSNR, SSIM, and LPIPS in view synthesis tasks, robust to pose noise and missing data (Sajjadi et al., 2021, Safin et al., 2023).
LVT yields high-fidelity renderings in large-scale scenes at linear inference time, outperforming global-attention and per-scene optimized 3DGS pipelines (Imtiaz et al., 29 Sep 2025).
OSRT provides nearly perfect slot assignment consistency across views and renders at $\sim 3,000\times$ speedup over volumetric object-centric methods (Sajjadi et al., 2022).
ASTRA and GITSR outperform baselines in trajectory prediction and multi-agent decision tasks, with substantial gains in data efficiency, safety, and behavioral diversity (Teeti et al., 16 Jan 2025, Hu et al., 2024).

Limitations include residual blurriness from L2 losses, lack of explicit geometry, and occasionally suboptimal compositional boundaries. Future directions entail integrating explicit geometric inductive biases, unsupervised camera calibration, dynamic scene handling, and symbolic conditioning (Safin et al., 2023, Imtiaz et al., 29 Sep 2025, Sajjadi et al., 2022).

Scene Representation Transformers underlie a unifying architectural class that enables scalable, permutation-invariant, and geometry-aware reasoning for rendering, prediction, and planning. With ongoing advances in local attention, relative-pose injection, object-centricity, and multimodal stochasticity, SRT variants are increasingly foundational for next-generation vision, robotics, and agent-centric intelligence (Sajjadi et al., 2021, Safin et al., 2023, Imtiaz et al., 29 Sep 2025, Sajjadi et al., 2022, Hu et al., 2024, Teeti et al., 16 Jan 2025, Ngiam et al., 2021).