RIR-Former: Grid-Free RIR Reconstruction
- RIR-Former is a coordinate-guided Transformer architecture that reconstructs continuous room impulse responses using sinusoidal positional encoding and a segmented decoder.
- It leverages multi-band sinusoidal embeddings to inject spatial priors, enabling accurate interpolation across sparse and irregular microphone arrays.
- Experimental evaluations show RIR-Former achieves state-of-the-art NMSE and cosine distance metrics while drastically reducing inference time.
RIR-Former is a coordinate-guided Transformer architecture designed for continuous, grid-free reconstruction of room impulse responses (RIRs) at arbitrary spatial microphone locations, enabling robust interpolation across array geometries and configurations. Developed in response to the need for practical, dense RIR field estimation from sparse measurements, RIR-Former introduces a sinusoidal encoding module for position information and a segmented multi-branch decoder to improve modeling of temporal acoustic structure. It achieves state-of-the-art accuracy and inference speed on simulated arrays, substantially outperforming previous neural and physics-informed baselines in normalized mean square error and cosine distance under high missing-data rates (Xu et al., 2 Feb 2026).
1. Coordinate Encoding and Input Representation
Each microphone or spatial query location is mapped as a three-dimensional coordinate . RIR-Former applies a multi-band sinusoidal embedding
with frequency bands, yielding features per coordinate axis, for a total of 108 positional encoding dimensions. This technique injects spatial priors enabling the network to capture both large-scale and fine-grained position dependencies, critical for reconstructing RIRs at positions unobserved during training.
The input token for each microphone is constructed by concatenating its position embedding with a learned acoustic embedding computed from the raw RIR waveform (dimension , typically 256) via a small MLP:
All input tokens are projected into a common -dimensional space by an initial MLP before entering the Transformer layers (Xu et al., 2 Feb 2026).
2. Transformer Encoder Backbone
RIR-Former employs an encoder-only Transformer architecture without learned positional biases or cross-attention. The architecture comprises layers (e.g., ), each consisting of multi-head self-attention with heads (e.g., ), hidden dimension (typically 256), and ReLU-activated feed-forward modules with inner dimension $4D$. Inputs are shaped as for observed microphones.
After processing through the stacked self-attention and feed-forward blocks, each token is refined into a contextual code , capturing both local acoustic information and geometrical context among observed microphones.
RIR-Former performs self-attention exclusively among observed (i.e., known) points, omitting cross-attention mechanisms used in other transformer-based RIR estimators (e.g., Few-ShotRIR (Majumder et al., 2022)). This design is justified by the explicit sinusoidal coordinate encoding, which suffices to provide geometric relational information across microphone positions (Xu et al., 2 Feb 2026).
3. Segmented Multi-Branch Decoder
To address the distinct acoustic regimes in RIRs—direct sound, early reflections, and late reverberation—RIR-Former decodes RIRs in temporally segmented branches. Each output RIR of length samples is split into segments (e.g., ). For segment , a dedicated MLP predicts the corresponding portion of the RIR: A lightweight residual denoising MLP further refines the concatenated waveform to mitigate artifacts. The model is supervised via a single mean squared error (MSE) objective over the entire RIR, although each segment-specific decoder is fine-tuned for 20 epochs after global training to balance errors across time (Xu et al., 2 Feb 2026).
This decoder design prevents the network from over-emphasizing the high-energy early parts of the RIR, producing more accurate reconstructions of both reverberation tails and reflection structure.
4. Learning and Optimization Protocol
Training employs the AdamW optimizer (learning rate ), batch size 8 arrays, and a two-stage schedule: 200 epochs of main training followed by 20 epochs of branch-specific fine-tuning. During training, a variable masking schedule increases the portion of missing microphones from 30% to 70% over the first 10 epochs, then maintains 70%, forcing the model to learn robust spatial interpolation. Input RIRs are normalized per sample; missing observations are zeroed and masked from gradient computation.
The training objective minimizes the averaged squared error over reconstructed missing RIRs: where is the number of masked (to be reconstructed) microphones per array (Xu et al., 2 Feb 2026).
5. Experimental Evaluation and Benchmarks
Experiments cover both on-grid uniform linear arrays (ULA) and random-spacing linear arrays (RSLA), each with 64 total microphones and up to 90% missing data. Key experimental parameters include:
- ULA: m region, array length Uniform[1.28, 3] m, fixed source, , kHz.
- RSLA: m region, random orientation/source, .
Performance is reported for:
- Normalized Mean Squared Error (NMSE) in dB
- Cosine Distance (CD), which measures shape similarity of the time-domain RIR
| Method | NMSE (dB) | CD | Retrain | Inference (s) |
|---|---|---|---|---|
| RIR-Former | –10.44 | 0.051 | N/A | 0.002 |
| PINN | –2.56 | 0.293 | ≥1 h | 0.883 |
| DiffusionRIR | –0.62 | 0.325 | N/A | 128.8 |
| SCI | 2.17 | 0.808 | N/A | 0.178 |
RIR-Former maintains NMSE below –5 dB and CD below 0.2 even with 90% missing data. On random-spacing arrays, RIR-Former reports NMSE –8.76 dB and CD 0.078 at 70% missing; PINN achieves –3.16 dB and 0.319, respectively (Xu et al., 2 Feb 2026). Inference is simultaneous for all locations (0.002 s), facilitating rapid deployment.
6. Ablation Findings
The necessity of RIR-Former's design components is demonstrated via ablation:
- Omitting sinusoidal encoding degrades NMSE from –8.76 dB to –4.78 dB and CD from 0.078 to 0.177.
- Omitting the segmented decoder increases NMSE to –6.52 dB and CD to 0.118.
These results confirm that sinusoidal encodings inject essential geometric information, while segmentation in the decoder is required for accurate modeling of late reverberant tails and to avoid bias to early arrivals (Xu et al., 2 Feb 2026).
7. Limitations, Open Questions, and Future Directions
Current experiments rely on (co)planar, linear microphone arrays in simulated static scenes. Addressing generalization to arbitrary 3D geometries (e.g., sphere, arbitrary volumes), time-varying scenes (moving sources, non-stationary reverberation), and real-world data remains an open research direction. Potential extensions include hierarchical attention for large-scale arrays and temporal modules to model dynamic acoustics. Validation on real-room measurements (with mismatch and noise) is necessary to confirm applicability outside simulation (Xu et al., 2 Feb 2026).
A plausible implication is that RIR-Former's continuous, grid-free, and one-shot inference architecture provides a foundation for robust, scalable acoustic field reconstruction in virtual/augmented reality, audio scene analysis, and robotic audition, particularly when dense sampling is impractical.