Papers
Topics
Authors
Recent
Search
2000 character limit reached

Binaural Spatially Adaptive Neural Network

Updated 14 February 2026
  • BSANN is a data-driven audio rendering framework that creates independent binaural sound zones using neural synthesis and acoustic modeling.
  • It leverages head tracking, Fourier positional encoding, and MLPs to dynamically control spatial audio for multiple listeners in shared spaces.
  • The system integrates explicit crosstalk cancellation and adaptive training protocols to boost isolation, spatial imaging, and robustness in varied environments.

The Binaural Spatially Adaptive Neural Network (BSANN) is a data-driven audio rendering framework designed for personal sound zones (PSZs). Its core function is to deliver fully independent, head-tracked stereo programs to multiple listeners, achieving precise control over the acoustic field at each ear in a shared space. By leveraging neural filter synthesis, physically informed acoustic models, and explicit crosstalk-cancellation strategies, BSANN advances the spatial audio field beyond monophonic rendering limits. This enables independent, binaural experiences for each listener, with improved spatial imaging fidelity, isolation, and robustness in reflective, asymmetric environments (Jiang et al., 10 Jan 2026).

1. Model Architecture and Inputs

BSANN is designed to handle two simultaneously head-tracked listeners, each characterized by a 6-degree-of-freedom (DOF) state (3D position, 3D orientation), succinctly represented as: s=[x1,y1,z1,ϕ1,θ1,ψ1,x2,y2,z2,ϕ2,θ2,ψ2]T\mathbf{s} = [x_{1}, y_{1}, z_{1}, \phi_{1}, \theta_{1}, \psi_{1}, x_{2}, y_{2}, z_{2}, \phi_{2}, \theta_{2}, \psi_{2}]^{\mathsf{T}} To enable the neural network to capture high-frequency spatial variations, each scalar component in s\mathbf{s} undergoes a Fourier positional encoding: [sin(2kπsi),cos(2kπsi)]k=0K1[\sin(2^{k}\pi s_{i}), \cos(2^{k}\pi s_{i})]_{k=0\ldots K-1} The concatenated encoded features are processed by a fully-connected Multi-Layer Perceptron (MLP) of NN layers, each with ReLU activation: h(+1)=ReLU(W()h()+b()),=0N1\mathbf{h}^{(\ell+1)} = \mathrm{ReLU}(W^{(\ell)}\mathbf{h}^{(\ell)} + b^{(\ell)}), \quad \ell=0\ldots N-1 The final layer outputs, for each of the LL loudspeakers and four program channels (L1 left/right, L2 left/right), a complex weight per frequency bin ωn\omega_n: g(ωn)=fθ(PE(s))CL×4\mathbf{g}(\omega_n) = f_{\theta}(\mathrm{PE}(\mathbf{s})) \in \mathbb{C}^{L\times 4} Loudspeaker signals in the frequency domain are synthesized as: X(ωn)=p=14g,p(ωn)Sp(ωn)X_{\ell}(\omega_n) = \sum_{p=1}^{4} g_{\ell,p}(\omega_{n})\,S_{p}(\omega_{n}) The frequency-domain weights are translated to playback FIR filters via inverse FFT.

2. Mathematical Formulation and Loss Functions

The generated filter for loudspeaker kk and ear ee at ω\omega is denoted: hk,e(ω)=N(zk,pe;θ)h_{k,e}(\omega) = N(z_{k}, p_{e}; \theta) The initial training (PSZ pretraining) focuses on optimizing bright-zone/dark-zone ear control. Let HB(ωn)\mathbf{H}_{B}(\omega_n) and HD(ωn)\mathbf{H}_{D}(\omega_n) denote acoustic transfer functions (ATFs) at bright and dark zone ears respectively, with corresponding target pressures pT,B(ωn)\mathbf{p}_{T,B}(\omega_n).

Bright-zone loss: L1=1NMBn=1NpT,B(ωn)HB(ωn)g(ωn)22\mathcal{L}_{1} = \frac{1}{NM_{B}} \sum_{n=1}^{N} \bigl\| |\mathbf{p}_{T,B}(\omega_n)| - |\mathbf{H}_{B}(\omega_n)\,\mathbf{g}(\omega_n)| \bigr\|_2^2 Dark-zone loss: L2=1NMDn=1NHD(ωn)g(ωn)22\mathcal{L}_{2} = \frac{1}{NM_{D}} \sum_{n=1}^{N} \bigl\| \mathbf{H}_{D}(\omega_n)\,\mathbf{g}(\omega_n) \bigr\|_2^2 Regularization terms penalize excessive gain (L3\mathcal{L}_3) and non-compactness in time (L4\mathcal{L}_4), forming the pretraining loss: LPSZ=αL1+(1α)L2+βL3+γL4\mathcal{L}_{\mathrm{PSZ}} = \alpha\mathcal{L}_{1} + (1-\alpha)\mathcal{L}_{2} + \beta\mathcal{L}_{3} + \gamma\mathcal{L}_{4}

3. Training Protocol and Acoustic Integration

Training data are generated in randomized shoebox-room simulations, including varying dimensions, reverberation times, listener positions, and array geometries. For each scenario, LL-to-$4M$ impulse responses are produced and "dressed" with:

  • Measured anechoic loudspeaker spectra A(ω)A_{\ell}(\omega).
  • Analytic piston directivity D(ω,θ)D_{\ell}(\omega, \theta) for each loudspeaker: D(ω,θ)=2J1(kasinθ)/(kasinθ)D_{\ell}(\omega, \theta) = 2J_1(ka\sin\theta)/(ka\sin\theta).
  • Rigid-sphere HRTFs HHRTF,e,m(ω)H_{\mathrm{HRTF},e,m}(\omega).

The final ATF for training sums direct and reflective paths, weighted by the above components: H~e,m,(ω)=He,m,dir(ω)A(ω)D(ω,θe,m,)HHRTF,e,m(ω)+He,m,refl(ω)A(ω)\tilde{H}_{e,m,\ell}(\omega) = H^{\mathrm{dir}}_{e,m,\ell}(\omega)\,A_{\ell}(\omega)\,D_{\ell}(\omega,\theta_{e,m,\ell})\,H_{\mathrm{HRTF},e,m}(\omega) + H^{\mathrm{refl}}_{e,m,\ell}(\omega)\,A_{\ell}(\omega) Optimization proceeds in two stages: (i) PSZ pretraining using Adam and a learning rate of 10310^{-3}; (ii) active XTC fine-tuning at 10410^{-4}, with teacher-anchoring safeguarding the pretrained solution.

4. Active Crosstalk Cancellation Refinement

After pretraining, BSANN undergoes further adaptation to minimize crosstalk between ears (XTC stage), ensuring ipsilateral response integrity. The effective ear-transfer matrix is: Teff(ωn)=P(ωn)W(ωn)\mathbf{T}_{\mathrm{eff}}(\omega_n) = \mathbf{P}(\omega_n)\,\mathbf{W}(\omega_n) The composite XTC loss function is: LXTC=λoffLoff+λdiagLdiag+λregLreg\mathcal{L}_{\mathrm{XTC}} = \lambda_{\mathrm{off}}\,\mathcal{L}_{\mathrm{off}} + \lambda_{\mathrm{diag}}\,\mathcal{L}_{\mathrm{diag}} + \lambda_{\mathrm{reg}}\,\mathcal{L}_{\mathrm{reg}} where Loff\mathcal{L}_{\mathrm{off}} penalizes off-diagonal (interaural) leakage, Ldiag\mathcal{L}_{\mathrm{diag}} ensures diagonal (ipsilateral) response matching, and Lreg\mathcal{L}_{\mathrm{reg}} adds effort regularization sensitive to local conditioning. The final loss combines XTC terms, PSZ protection, and teacher-anchoring: Ltotal=λxtcLXTC+wBZLBZ+wDZLDZ+βLgain+γLcompact+ηLteach\mathcal{L}_{\mathrm{total}} = \lambda_{\mathrm{xtc}}\,\mathcal{L}_{\mathrm{XTC}} + w_{\mathrm{BZ}}\,\mathcal{L}_{\mathrm{BZ}} + w_{\mathrm{DZ}}\,\mathcal{L}_{\mathrm{DZ}} + \beta\,\mathcal{L}_{\mathrm{gain}} + \gamma\,\mathcal{L}_{\mathrm{compact}} + \eta\,\mathcal{L}_{\mathrm{teach}}

5. Performance Analysis

BSANN achieves marked improvements on frequency-averaged metrics (100–20,000 Hz):

Metric Listener 1 Listener 2
Inter-zone isolation (IZI) [dB] 10.23 10.03
Inter-program isolation (IPI) [dB] 11.11 9.16
Crosstalk cancellation (XTC) [dB] 10.55 11.13

Relative gains include:

  • Substantial improvement over monophonic SANN (IZI9.99/5.11\text{IZI} \approx 9.99/5.11 dB, XTC0\text{XTC} \approx 0 dB).
  • Superiority versus BSANN without physical acoustic modeling (IZI9.54/9.18\text{IZI} \approx 9.54/9.18 dB, XTC4.97/4.38\text{XTC} \approx 4.97/4.38 dB).
  • Enhanced XTC with active crosstalk cancellation relative to passive-only systems (XTC7.93/8.19\text{XTC} \approx 7.93/8.19 dB for non-XTC variant) (Jiang et al., 10 Jan 2026).

6. Deployment and Implementation Guidance

Practical BSANN deployment includes several calibration, adaptation, and integration steps:

  • Calibration: Anechoic impulse responses a[n]a_{\ell}[n] for each loudspeaker are measured once to form the acoustic basis for ongoing adaptation.
  • Head tracking: Real-time updates to the 6-DOF state vector s(t)\mathbf{s}(t) feed directly into the positional encoding and MLP, obviating explicit matrix inversion for every head movement.
  • Robustness to asymmetry: Ear-wise training allows bright/dark zone optimization at each ear, inherently adapting to reflective or geometrically asymmetric rooms.
  • Embedded deployment: Learned FIR filters may be converted to memory-efficient IIR or multi-rate implementations. Static look-up tables for anechoic responses supplement adaptive, low-footprint MLP updates for environmental variation.

BSANN thus unifies neural spatial filter design, detailed acoustic simulation, and explicit binaural crosstalk management, providing real-time, independent spatial audio zones for multiple listeners with a high degree of isolation and resilience to complex physical environments (Jiang et al., 10 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Binaural Spatially Adaptive Neural Network (BSANN).