Scene Representation Transformer (SRT)
- Scene Representation Transformer is a framework that converts multi-view and multi-agent sensory data into geometry-aware, permutation-invariant scene representations.
- It employs advanced self-attention mechanisms, including relative pose encoding and localized attention, to fuse unordered sets of image patches and agent histories efficiently.
- Variants like LVT, OSRT, and ASTRA extend the core design for scalable 3D reconstruction and trajectory forecasting, achieving significant improvements in fidelity and sample efficiency.
The Scene Representation Transformer (SRT) and its descendants constitute a foundational methodological class for learning abstract, geometry-aware, and interaction-aware scene representations directly from sensory data—most commonly for vision (novel view synthesis and 3D reconstruction) and autonomous agent prediction (trajectory forecasting, collaborative decision-making). The architecture generalizes conventional vision transformer designs to operate on unordered sets or graphs of images, agents, or grid-based scene elements, integrating global and local spatial information for downstream rendering, planning, or prediction at interactive rates and with rigorous equivariance properties (Sajjadi et al., 2021, Safin et al., 2023, Imtiaz et al., 29 Sep 2025, Liu et al., 2022, Teeti et al., 16 Jan 2025, Sajjadi et al., 2022, Hu et al., 2024, Ngiam et al., 2021).
1. Core Architectural Principles
Scene Representation Transformers process sets (or graphs) of input entities—images, agents, or grid cells—without explicit geometric priors or volumetric raymarching. In canonical SRT for vision (Sajjadi et al., 2021), a shared CNN extracts patchwise embeddings from each input image (posed or unposed). These patch tokens are fused via multi-head self-attention into a permutation-invariant set-latent scene representation (SLSR), generalizing the standard Vision Transformer (ViT) to unordered multi-view sets. When camera poses are available, per-patch positional encodings are appended, either globally (origin and direction in a reference frame) or, in advanced variants, using pairwise relative pose embeddings (Safin et al., 2023, Imtiaz et al., 29 Sep 2025).
For novel view synthesis, a Transformer decoder treats query rays as set-valued entities; each decoder block performs cross-attention from ray embedding queries to the scene latent tokens. Output heads parameterize color values directly (light-field) or, in variants like LVT, decode distributions over 3D splats for Gaussians (Imtiaz et al., 29 Sep 2025).
In agent-centric domains (e.g., autonomous driving, pedestrian forecasting), SRT operates analogously on sets or graphs of agent histories, road features, or interaction maps, employing axis-factored multi-head attention (Ngiam et al., 2021), graph-aware biases (Teeti et al., 16 Jan 2025), and dedicated cross-modality blocks (Liu et al., 2022).
2. Positional Encoding and Geometric Invariance
Positional encoding is critical for enabling spatial reasoning in SRT. The original model (Sajjadi et al., 2021) concatenated sinusoidal ray encodings of camera origin and direction in a chosen reference frame to each patch embedding, introducing explicit dependence on a global scene coordinate system.
This approach, however, proved non-invariant under reference-frame changes, causing instability for large-scale or sweeping scenes (Safin et al., 2023). Relative Pose Attention (RePAST) (Safin et al., 2023) resolves this by injecting pairwise relative camera pose information directly into every attention calculation. Letting and denote SE(3) extrinsics for views and , the relative pose is decomposed into rotation and translation, typically as a unit quaternion and 3D vector. Sinusoidal positional encodings are computed over and projected via MLP to produce local context vectors. This ensures coordinate-system invariance and stabilizes multi-view fusion for arbitrarily large or unbounded scenes (Safin et al., 2023, Imtiaz et al., 29 Sep 2025).
In LVT (Imtiaz et al., 29 Sep 2025), relative pose encoding is used similarly, but restricted to a local window of nearest-view neighbors, enabling scalable, linearly complex attention.
3. Computational Scaling: Local Attention and Linear Complexity
Standard transformer self-attention scales quadratically in the number of tokens and number of views, becoming intractable for high-resolution, large-scale scenes. LVT (Imtiaz et al., 29 Sep 2025) circumvents this via neighborhood-restricted attention: for each view, only tokens from the nearest neighboring views (plus self) are attended to at each layer. This yields complexity per layer, linear in the number of views , making interactive, high-resolution entire-scene reconstruction feasible.
LVT alternates neighborhood-restricted multi-head attention and feedforward layers in deep stacks (, heads, ), gradually propagating global context despite the locality of each operation.
In trajectory and decision-making domains, agent-centric SRTs employ similar axis-factored attention or restrict graph edges by motion proximity for efficient encoding of local social interactions and spatial constraints (Teeti et al., 16 Jan 2025, Hu et al., 2024, Ngiam et al., 2021).
4. Scene Representation Decoding
After fusion, the decoder predicts outputs for specified downstream tasks:
- Novel View Synthesis (Vision):
- The decoded representation parameterizes RGB values for arbitrary query rays by direct (light-field) mapping (Sajjadi et al., 2021, Safin et al., 2023).
- In LVT, tokens are "unpatchified" to pixel regions and output parameters for 3D Gaussian splats: means, covariances, SH color, and SH opacity coefficients (Imtiaz et al., 29 Sep 2025). Rendering aggregates splats via a differentiable front-to-back rasterizer.
- OSRT (Sajjadi et al., 2022) and object-centric models use Slot Attention to decompose the scene latent into object slots, which are then pooled via a Slot Mixer transformer per ray for compositional rendering.
- Agent Behavior Prediction:
- SRTs in trajectory prediction and decision making (ASTRA (Teeti et al., 16 Jan 2025), SRT for RL (Liu et al., 2022), GITSR (Hu et al., 2024)) decode per-agent future trajectories or joint actions by feeding agent–scene fused embeddings into MLP or RL heads. Stochastic variants encapsulate multimodal possible futures via CVAEs.
5. Training Objectives and Sample Efficiency
Training is end-to-end via per-pixel or per-agent supervision:
- Novel view synthesis is supervised with L2, PSNR, and perceptual losses, sometimes with regularizers for view-dependent opacity (Imtiaz et al., 29 Sep 2025, Sajjadi et al., 2021).
- Decision-making models employ reward-augmented RL objectives, typically via Soft Actor-Critic (SAC) or DQN heads wrapped around Transformer-encoded scene/agent latents (Liu et al., 2022, Hu et al., 2024).
- SLT (Sequential Latent Transformer) (Liu et al., 2022) imposes self-supervised future-consistency losses to drive latent representations to be maximally predictive over steps, substantially reducing exploration space and accelerating convergence.
- CVAE heads (ASTRA (Teeti et al., 16 Jan 2025)) enable stochastic trajectory prediction; weighted loss schedules upweight start/end errors to prevent drift.
SRT-based models consistently demonstrate marked improvements in sample efficiency, output fidelity, and generalizability across diverse domains. LVT achieves 3.54 dB PSNR improvement over global attention baselines at linear computational cost (Imtiaz et al., 29 Sep 2025); agent-centric SRTs reduce collision rates and task completion times in urban scenarios (Liu et al., 2022, Hu et al., 2024).
6. Variants and Extensions
Numerous SRT variants adapt the core principle to different modalities and requirements:
| Variant | Domain | Key Innovation |
|---|---|---|
| SRT (Sajjadi et al., 2021) | Novel view synthesis | Set-latent scene encoding, global pose, light-field decoder |
| RePAST (Safin et al., 2023) | View synthesis | Relative pose attention, invariance to global frame |
| LVT (Imtiaz et al., 29 Sep 2025) | Large-scale 3D reconstr. | Neighborhood-restricted attention, 3D Gaussian splats |
| OSRT (Sajjadi et al., 2022) | Object-centrism, comp. | Slot Attention + Slot Mixer compositionality |
| Scene Transformer (Ngiam et al., 2021) | Joint agent motion | Axis-factored (agent/time), mask-based task unification |
| ASTRA (Teeti et al., 16 Jan 2025) | Pedestrian forecasting | U-Net feature extractor, graph bias, stochastic CVAE |
| SRT-RL (Liu et al., 2022) | Driving/policy learning | Multi-stage encoder, SLT future distillation, SAC/RL head |
| GITSR (Hu et al., 2024) | Multi-vehicle RL | Occupation grids + GNN, feasible region via MHA |
This suggests the SRT paradigm is readily extensible: local attention schemes for scalable 3D reconstructions, graph-aware modules for social interaction, slot-mixing decoders for compositionality, and multimodal heads for uncertainty.
7. Quantitative Benchmarks and Limitations
Empirical results corroborate SRT's superiority across multiple datasets and metrics:
- SRT achieves state-of-the-art PSNR, SSIM, and LPIPS in view synthesis tasks, robust to pose noise and missing data (Sajjadi et al., 2021, Safin et al., 2023).
- LVT yields high-fidelity renderings in large-scale scenes at linear inference time, outperforming global-attention and per-scene optimized 3DGS pipelines (Imtiaz et al., 29 Sep 2025).
- OSRT provides nearly perfect slot assignment consistency across views and renders at speedup over volumetric object-centric methods (Sajjadi et al., 2022).
- ASTRA and GITSR outperform baselines in trajectory prediction and multi-agent decision tasks, with substantial gains in data efficiency, safety, and behavioral diversity (Teeti et al., 16 Jan 2025, Hu et al., 2024).
Limitations include residual blurriness from L2 losses, lack of explicit geometry, and occasionally suboptimal compositional boundaries. Future directions entail integrating explicit geometric inductive biases, unsupervised camera calibration, dynamic scene handling, and symbolic conditioning (Safin et al., 2023, Imtiaz et al., 29 Sep 2025, Sajjadi et al., 2022).
Scene Representation Transformers underlie a unifying architectural class that enables scalable, permutation-invariant, and geometry-aware reasoning for rendering, prediction, and planning. With ongoing advances in local attention, relative-pose injection, object-centricity, and multimodal stochasticity, SRT variants are increasingly foundational for next-generation vision, robotics, and agent-centric intelligence (Sajjadi et al., 2021, Safin et al., 2023, Imtiaz et al., 29 Sep 2025, Sajjadi et al., 2022, Hu et al., 2024, Teeti et al., 16 Jan 2025, Ngiam et al., 2021).