Binaural Spatially Adaptive Neural Network
- BSANN is a data-driven audio rendering framework that creates independent binaural sound zones using neural synthesis and acoustic modeling.
- It leverages head tracking, Fourier positional encoding, and MLPs to dynamically control spatial audio for multiple listeners in shared spaces.
- The system integrates explicit crosstalk cancellation and adaptive training protocols to boost isolation, spatial imaging, and robustness in varied environments.
The Binaural Spatially Adaptive Neural Network (BSANN) is a data-driven audio rendering framework designed for personal sound zones (PSZs). Its core function is to deliver fully independent, head-tracked stereo programs to multiple listeners, achieving precise control over the acoustic field at each ear in a shared space. By leveraging neural filter synthesis, physically informed acoustic models, and explicit crosstalk-cancellation strategies, BSANN advances the spatial audio field beyond monophonic rendering limits. This enables independent, binaural experiences for each listener, with improved spatial imaging fidelity, isolation, and robustness in reflective, asymmetric environments (Jiang et al., 10 Jan 2026).
1. Model Architecture and Inputs
BSANN is designed to handle two simultaneously head-tracked listeners, each characterized by a 6-degree-of-freedom (DOF) state (3D position, 3D orientation), succinctly represented as: To enable the neural network to capture high-frequency spatial variations, each scalar component in undergoes a Fourier positional encoding: The concatenated encoded features are processed by a fully-connected Multi-Layer Perceptron (MLP) of layers, each with ReLU activation: The final layer outputs, for each of the loudspeakers and four program channels (L1 left/right, L2 left/right), a complex weight per frequency bin : Loudspeaker signals in the frequency domain are synthesized as: The frequency-domain weights are translated to playback FIR filters via inverse FFT.
2. Mathematical Formulation and Loss Functions
The generated filter for loudspeaker and ear at is denoted: The initial training (PSZ pretraining) focuses on optimizing bright-zone/dark-zone ear control. Let and denote acoustic transfer functions (ATFs) at bright and dark zone ears respectively, with corresponding target pressures .
Bright-zone loss: Dark-zone loss: Regularization terms penalize excessive gain () and non-compactness in time (), forming the pretraining loss:
3. Training Protocol and Acoustic Integration
Training data are generated in randomized shoebox-room simulations, including varying dimensions, reverberation times, listener positions, and array geometries. For each scenario, -to-$4M$ impulse responses are produced and "dressed" with:
- Measured anechoic loudspeaker spectra .
- Analytic piston directivity for each loudspeaker: .
- Rigid-sphere HRTFs .
The final ATF for training sums direct and reflective paths, weighted by the above components: Optimization proceeds in two stages: (i) PSZ pretraining using Adam and a learning rate of ; (ii) active XTC fine-tuning at , with teacher-anchoring safeguarding the pretrained solution.
4. Active Crosstalk Cancellation Refinement
After pretraining, BSANN undergoes further adaptation to minimize crosstalk between ears (XTC stage), ensuring ipsilateral response integrity. The effective ear-transfer matrix is: The composite XTC loss function is: where penalizes off-diagonal (interaural) leakage, ensures diagonal (ipsilateral) response matching, and adds effort regularization sensitive to local conditioning. The final loss combines XTC terms, PSZ protection, and teacher-anchoring:
5. Performance Analysis
BSANN achieves marked improvements on frequency-averaged metrics (100–20,000 Hz):
| Metric | Listener 1 | Listener 2 |
|---|---|---|
| Inter-zone isolation (IZI) [dB] | 10.23 | 10.03 |
| Inter-program isolation (IPI) [dB] | 11.11 | 9.16 |
| Crosstalk cancellation (XTC) [dB] | 10.55 | 11.13 |
Relative gains include:
- Substantial improvement over monophonic SANN ( dB, dB).
- Superiority versus BSANN without physical acoustic modeling ( dB, dB).
- Enhanced XTC with active crosstalk cancellation relative to passive-only systems ( dB for non-XTC variant) (Jiang et al., 10 Jan 2026).
6. Deployment and Implementation Guidance
Practical BSANN deployment includes several calibration, adaptation, and integration steps:
- Calibration: Anechoic impulse responses for each loudspeaker are measured once to form the acoustic basis for ongoing adaptation.
- Head tracking: Real-time updates to the 6-DOF state vector feed directly into the positional encoding and MLP, obviating explicit matrix inversion for every head movement.
- Robustness to asymmetry: Ear-wise training allows bright/dark zone optimization at each ear, inherently adapting to reflective or geometrically asymmetric rooms.
- Embedded deployment: Learned FIR filters may be converted to memory-efficient IIR or multi-rate implementations. Static look-up tables for anechoic responses supplement adaptive, low-footprint MLP updates for environmental variation.
BSANN thus unifies neural spatial filter design, detailed acoustic simulation, and explicit binaural crosstalk management, providing real-time, independent spatial audio zones for multiple listeners with a high degree of isolation and resilience to complex physical environments (Jiang et al., 10 Jan 2026).