Binaural Spatially Adaptive Neural Network

Updated 14 February 2026

BSANN is a data-driven audio rendering framework that creates independent binaural sound zones using neural synthesis and acoustic modeling.
It leverages head tracking, Fourier positional encoding, and MLPs to dynamically control spatial audio for multiple listeners in shared spaces.
The system integrates explicit crosstalk cancellation and adaptive training protocols to boost isolation, spatial imaging, and robustness in varied environments.

The Binaural Spatially Adaptive Neural Network (BSANN) is a data-driven audio rendering framework designed for personal sound zones (PSZs). Its core function is to deliver fully independent, head-tracked stereo programs to multiple listeners, achieving precise control over the acoustic field at each ear in a shared space. By leveraging neural filter synthesis, physically informed acoustic models, and explicit crosstalk-cancellation strategies, BSANN advances the spatial audio field beyond monophonic rendering limits. This enables independent, binaural experiences for each listener, with improved spatial imaging fidelity, isolation, and robustness in reflective, asymmetric environments (Jiang et al., 10 Jan 2026).

1. Model Architecture and Inputs

BSANN is designed to handle two simultaneously head-tracked listeners, each characterized by a 6-degree-of-freedom (DOF) state (3D position, 3D orientation), succinctly represented as: $\mathbf{s} = [x_{1}, y_{1}, z_{1}, \phi_{1}, \theta_{1}, \psi_{1}, x_{2}, y_{2}, z_{2}, \phi_{2}, \theta_{2}, \psi_{2}]^{\mathsf{T}}$ To enable the neural network to capture high-frequency spatial variations, each scalar component in $\mathbf{s}$ undergoes a Fourier positional encoding: $[\sin(2^{k}\pi s_{i}), \cos(2^{k}\pi s_{i})]_{k=0\ldots K-1}$ The concatenated encoded features are processed by a fully-connected Multi-Layer Perceptron (MLP) of $N$ layers, each with ReLU activation: $\mathbf{h}^{(\ell+1)} = \mathrm{ReLU}(W^{(\ell)}\mathbf{h}^{(\ell)} + b^{(\ell)}), \quad \ell=0\ldots N-1$ The final layer outputs, for each of the $L$ loudspeakers and four program channels (L1 left/right, L2 left/right), a complex weight per frequency bin $\omega_n$ : $\mathbf{g}(\omega_n) = f_{\theta}(\mathrm{PE}(\mathbf{s})) \in \mathbb{C}^{L\times 4}$ Loudspeaker signals in the frequency domain are synthesized as: $X_{\ell}(\omega_n) = \sum_{p=1}^{4} g_{\ell,p}(\omega_{n})\,S_{p}(\omega_{n})$ The frequency-domain weights are translated to playback FIR filters via inverse FFT.

2. Mathematical Formulation and Loss Functions

The generated filter for loudspeaker $k$ and ear $e$ at $\omega$ is denoted: $h_{k,e}(\omega) = N(z_{k}, p_{e}; \theta)$ The initial training (PSZ pretraining) focuses on optimizing bright-zone/dark-zone ear control. Let $\mathbf{H}_{B}(\omega_n)$ and $\mathbf{H}_{D}(\omega_n)$ denote acoustic transfer functions (ATFs) at bright and dark zone ears respectively, with corresponding target pressures $\mathbf{p}_{T,B}(\omega_n)$ .

Bright-zone loss: $\mathcal{L}_{1} = \frac{1}{NM_{B}} \sum_{n=1}^{N} \bigl\| |\mathbf{p}_{T,B}(\omega_n)| - |\mathbf{H}_{B}(\omega_n)\,\mathbf{g}(\omega_n)| \bigr\|_2^2$ Dark-zone loss: $\mathcal{L}_{2} = \frac{1}{NM_{D}} \sum_{n=1}^{N} \bigl\| \mathbf{H}_{D}(\omega_n)\,\mathbf{g}(\omega_n) \bigr\|_2^2$ Regularization terms penalize excessive gain ( $\mathcal{L}_3$ ) and non-compactness in time ( $\mathcal{L}_4$ ), forming the pretraining loss: $\mathcal{L}_{\mathrm{PSZ}} = \alpha\mathcal{L}_{1} + (1-\alpha)\mathcal{L}_{2} + \beta\mathcal{L}_{3} + \gamma\mathcal{L}_{4}$

3. Training Protocol and Acoustic Integration

Training data are generated in randomized shoebox-room simulations, including varying dimensions, reverberation times, listener positions, and array geometries. For each scenario, $L$ -to-$4M$ impulse responses are produced and "dressed" with:

Measured anechoic loudspeaker spectra $A_{\ell}(\omega)$ .
Analytic piston directivity $D_{\ell}(\omega, \theta)$ for each loudspeaker: $D_{\ell}(\omega, \theta) = 2J_1(ka\sin\theta)/(ka\sin\theta)$ .
Rigid-sphere HRTFs $H_{\mathrm{HRTF},e,m}(\omega)$ .

The final ATF for training sums direct and reflective paths, weighted by the above components: $\tilde{H}_{e,m,\ell}(\omega) = H^{\mathrm{dir}}_{e,m,\ell}(\omega)\,A_{\ell}(\omega)\,D_{\ell}(\omega,\theta_{e,m,\ell})\,H_{\mathrm{HRTF},e,m}(\omega) + H^{\mathrm{refl}}_{e,m,\ell}(\omega)\,A_{\ell}(\omega)$ Optimization proceeds in two stages: (i) PSZ pretraining using Adam and a learning rate of $10^{-3}$ ; (ii) active XTC fine-tuning at $10^{-4}$ , with teacher-anchoring safeguarding the pretrained solution.

After pretraining, BSANN undergoes further adaptation to minimize crosstalk between ears (XTC stage), ensuring ipsilateral response integrity. The effective ear-transfer matrix is: $\mathbf{T}_{\mathrm{eff}}(\omega_n) = \mathbf{P}(\omega_n)\,\mathbf{W}(\omega_n)$ The composite XTC loss function is: $\mathcal{L}_{\mathrm{XTC}} = \lambda_{\mathrm{off}}\,\mathcal{L}_{\mathrm{off}} + \lambda_{\mathrm{diag}}\,\mathcal{L}_{\mathrm{diag}} + \lambda_{\mathrm{reg}}\,\mathcal{L}_{\mathrm{reg}}$ where $\mathcal{L}_{\mathrm{off}}$ penalizes off-diagonal (interaural) leakage, $\mathcal{L}_{\mathrm{diag}}$ ensures diagonal (ipsilateral) response matching, and $\mathcal{L}_{\mathrm{reg}}$ adds effort regularization sensitive to local conditioning. The final loss combines XTC terms, PSZ protection, and teacher-anchoring: $\mathcal{L}_{\mathrm{total}} = \lambda_{\mathrm{xtc}}\,\mathcal{L}_{\mathrm{XTC}} + w_{\mathrm{BZ}}\,\mathcal{L}_{\mathrm{BZ}} + w_{\mathrm{DZ}}\,\mathcal{L}_{\mathrm{DZ}} + \beta\,\mathcal{L}_{\mathrm{gain}} + \gamma\,\mathcal{L}_{\mathrm{compact}} + \eta\,\mathcal{L}_{\mathrm{teach}}$

5. Performance Analysis

BSANN achieves marked improvements on frequency-averaged metrics (100–20,000 Hz):

Metric	Listener 1	Listener 2
Inter-zone isolation (IZI) [dB]	10.23	10.03
Inter-program isolation (IPI) [dB]	11.11	9.16
Crosstalk cancellation (XTC) [dB]	10.55	11.13

Relative gains include:

Substantial improvement over monophonic SANN ( $\text{IZI} \approx 9.99/5.11$ dB, $\text{XTC} \approx 0$ dB).
Superiority versus BSANN without physical acoustic modeling ( $\text{IZI} \approx 9.54/9.18$ dB, $\text{XTC} \approx 4.97/4.38$ dB).
Enhanced XTC with active crosstalk cancellation relative to passive-only systems ( $\text{XTC} \approx 7.93/8.19$ dB for non-XTC variant) (Jiang et al., 10 Jan 2026).

6. Deployment and Implementation Guidance

Practical BSANN deployment includes several calibration, adaptation, and integration steps:

Calibration: Anechoic impulse responses $a_{\ell}[n]$ for each loudspeaker are measured once to form the acoustic basis for ongoing adaptation.
Head tracking: Real-time updates to the 6-DOF state vector $\mathbf{s}(t)$ feed directly into the positional encoding and MLP, obviating explicit matrix inversion for every head movement.
Robustness to asymmetry: Ear-wise training allows bright/dark zone optimization at each ear, inherently adapting to reflective or geometrically asymmetric rooms.
Embedded deployment: Learned FIR filters may be converted to memory-efficient IIR or multi-rate implementations. Static look-up tables for anechoic responses supplement adaptive, low-footprint MLP updates for environmental variation.

BSANN thus unifies neural spatial filter design, detailed acoustic simulation, and explicit binaural crosstalk management, providing real-time, independent spatial audio zones for multiple listeners with a high degree of isolation and resilience to complex physical environments (Jiang et al., 10 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Stereo Audio Rendering for Personal Sound Zones Using a Binaural Spatially Adaptive Neural Network (BSANN) (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Binaural Spatially Adaptive Neural Network (BSANN).

Binaural Spatially Adaptive Neural Network

1. Model Architecture and Inputs

2. Mathematical Formulation and Loss Functions

3. Training Protocol and Acoustic Integration

4. Active Crosstalk Cancellation Refinement

5. Performance Analysis

6. Deployment and Implementation Guidance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Binaural Spatially Adaptive Neural Network

1. Model Architecture and Inputs

2. Mathematical Formulation and Loss Functions

3. Training Protocol and Acoustic Integration

4. Active Crosstalk Cancellation Refinement

5. Performance Analysis

6. Deployment and Implementation Guidance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research