ACCDOA: Unified Sound Localization & Detection

Updated 16 December 2025

ACCDOA is a unified representation that encodes sound event activity and DOA as a 3D Cartesian vector, merging SED and SEL into a single regression task.
It streamlines model design by replacing dual-branch architectures with a single MSE loss, reducing parameters and eliminating manual loss weighting.
Extensions like multi-ACCDOA and embedding-augmented variants improve localization in polyphonic and zero-/few-shot scenarios, achieving state-of-the-art performance.

Activity-Coupled Cartesian DOA (ACCDOA) is a neural network output representation for sound event localization and detection (SELD) that tightly unifies sound event detection (SED) and direction-of-arrival (DOA) estimation into a single multi-output regression problem. Each class's presence and spatial position are encoded jointly by a three-dimensional Cartesian vector whose length indicates event activity and whose direction corresponds to the unit DOA vector. This approach enables the SELD task to be posed as a single regression to vector targets, sidestepping manual loss-weighting and parameter inefficiencies of conventional two-branch systems. The method has demonstrated state-of-the-art results on SELD benchmarks and has catalyzed further innovations for handling overlapping sources and zero- or few-shot localization.

1. Mathematical Definition and Core Principle

Let $C$ denote the number of event classes and $t$ index frame time. For each class $c$ at frame $t$ , the ACCDOA representation defines:

$a_c(t) \in [0,1]$ : event activity (1 if active, 0 if inactive; may be probabilistic/soft)
$(\varphi_c(t),\theta_c(t))$ : source azimuth and elevation angles

These are mapped to a unit-length Cartesian DOA vector

$\mathbf{u}_c(t) = [x_c(t), \, y_c(t), \, z_c(t)]^\top = [\cos\theta_c(t)\cos\varphi_c(t),\, \cos\theta_c(t)\sin\varphi_c(t),\, \sin\theta_c(t)]^\top$

The ACCDOA vector is then

$\mathbf{v}_c(t) = a_c(t) \cdot \mathbf{u}_c(t) \in \mathbb{R}^3$

so that $\Vert \mathbf{v}_c(t) \Vert_2 = a_c(t)$ : the magnitude encodes event activity, the direction encodes DOA on the unit sphere. For inactive events, $\mathbf{v}_c(t) = \mathbf{0}$ (Shimada et al., 2020, Shimada et al., 2020).

This formulation allows both SED and SEL to be inferred from network output:

Event presence: $\hat{p}_c(t) = \Vert \hat{\mathbf{v}}_c(t) \Vert_2$
DOA estimate: $\hat{\mathbf{u}}_c(t) = \hat{\mathbf{v}}_c(t)/\Vert \hat{\mathbf{v}}_c(t) \Vert_2$ for $\Vert \hat{\mathbf{v}}_c(t) \Vert_2>0$ .

2. Unified Regression and Loss Formulation

ACCDOA replaces dual-branch architectures (one for SED, one for SEL) and their respective losses with a single unified regression to vector targets: $L = \frac{1}{TC}\sum_{t=1}^T\sum_{c=1}^C \Vert \hat{\mathbf{v}}_c(t)-\mathbf{v}_c(t)\Vert_2^2$ When a class is inactive ( $a_c(t)=0$ ), the regression target is the zero vector. The network is naturally penalized for spurious nonzero outputs, enforcing both detection and localization sparsity without explicit masking. There is no need for loss-weight balancing between SED and SEL; all outputs are optimized under MSE (Shimada et al., 2020, Shimada et al., 2020).

This approach enables significant parameter savings (by removing one output head and associated layers) and has been empirically shown to outperform two-branch or two-stage SELD pipelines in localization accuracy and F-score (Shimada et al., 2020).

3. Network Architectures: RD3Net and Beyond

ACCDOA-based systems are typically instantiated by projecting the output of a temporal backbone (e.g., a CRNN or a densely connected multi-dilated convolutional network such as RD3Net) to $3C$ output dimensions—one 3-vector per class per frame. Key architectural features include:

RD3Net: an adaptation of D3Net with only encoder blocks (multi-rate dilated convolution, dense connectivity), recurrent bottleneck (e.g., bi-GRU), and final linear projection (Shimada et al., 2020, Shimada et al., 2021).
Network deconvolution is used in place of batch normalization in RD3Net, and dropout is omitted at the output.
For further stabilization, post-processing with input signal rotations and averaging is adopted (Shimada et al., 2020).

Parameter counts for ACCDOA-based RD3Net systems range from 1.5M to under 6M depending on configuration, consistently smaller than track-wise or two-branch counterparts (Shimada et al., 2020, Shimada et al., 2021).

4. Extension to Multi-Source and Overlapping Events: Multi-ACCDOA

The original, class-wise ACCDOA can only encode one active DOA per class per frame, a limitation in polyphonic scenes with overlapping same-class events. Multi-ACCDOA extends the output format by introducing a "track" dimension, producing up to $N$ three-dimensional vectors per class and frame. Each track is responsible for one source; multiple tracks per class permit simultaneous same-class localization.

Assignment between ground-truth sources and tracks is handled with class-wise permutation-invariant training (PIT). Auxiliary Duplicating PIT (ADPIT) further duplicates ground truth across extra tracks for frames with fewer than $N$ sources to avoid learning degenerate zeros. The loss is

$\mathcal{L}^{\text{ADPIT}} = \frac{1}{CT}\sum_{c=1}^C\sum_{t=1}^T \min_{\alpha\in \mathrm{Perm}(c,t)} \frac{1}{N}\sum_{n=1}^N \Vert \mathbf{P}^*_{\alpha(n),c,t} - \hat{\mathbf{P}}_{n,c,t} \Vert_2^2$

This allows high recall in same-class overlap scenarios without a growth in parameter count or output dimensionality proportional to the worst-case polyphony (Shimada et al., 2021, Shimada et al., 2023).

5. Zero- and Few-Shot SELD With ACCDOA Embeddings

Recent work leverages ACCDOA in a zero- or few-shot SELD setting, coupling the ACCDOA representation per track with a high-dimensional embedding (e.g., a CLAP audio/text embedding). This "embed-ACCDOA" approach allows the model to output not only activity and DOA per track but also directly assign class semantics via embedding similarity, supporting compositional and previously unseen class detection.

Training uses permutation-invariant loss over both the ACCDOA vectors and the embedding space. This enables the system to perform class assignment post hoc by matching track embeddings to support sets derived from audio or text (Shimada et al., 2023). The architecture includes parallel embedding and ACCDOA branches with shared early layers and MHSA blocks.

6. Empirical Performance and Comparative Analysis

On DCASE benchmarks:

RD3Net-ACCDOA single-model achieves LE $_{CD}$ =7.9°, F $_{20°}$ =76.8%, LR $_{CD}$ =80.5%, ER $_{20°}$ =0.32 on the 2020-2021 test splits, outperforming both two-stage baselines and prior state-of-the-art (Shimada et al., 2020, Shimada et al., 2020, Shimada et al., 2021).
Ensembles based on ACCDOA and EINV2 models, fused by output averaging, further improve metrics (e.g., F $_{20°}$ up to 69.6% in DCASE2021 with LE $_{CD}$ =10.7°) (Shimada et al., 2021).
Multitrack ACCDOA with class-wise ADPIT improves overlap recall and performance in polyphonic settings, approaching high-complexity track-wise architectures with an order of magnitude fewer parameters (Shimada et al., 2021).

In zero/few-shot experiments, embed-ACCDOA models with $N=3$ tracks and CLAP embeddings achieve location-dependent F-scores (F $_{20}$ ) of 19.3–19.2 with ten support shots, compared to 29.4 in fully supervised official baselines, indicating strong open-set SELD capabilities (Shimada et al., 2023).

7. Advantages, Limitations, and Future Directions

Advantages:

ACCDOA provides a unified formulation for SED and DOA regression, collapses model complexity, and eliminates manual loss weight tuning (Shimada et al., 2020).
Performance gains are observed in F-score, localization accuracy, and resource efficiency compared to conventional or dual-branch alternatives (Shimada et al., 2020, Shimada et al., 2021).
Extensions to multi-track and embedding-augmented variants address major limitations, such as overlapping same-class sources and open-set class localization (Shimada et al., 2021, Shimada et al., 2023).

Limitations:

The original formulation cannot represent more than one same-class source per frame, which is mitigated by multi-ACCDOA (Shimada et al., 2021).
In fully polyphonic scenes, performance still degrades compared to single-source segments (Shimada et al., 2020).
Interpreting vector magnitude as a direct probability remains less transparent than typical SED sigmoid outputs (Shimada et al., 2020).

Ongoing work explores: further model refinements (e.g., transformer encoders), advanced augmentation, architectural tweaks such as time–frequency RNNs, and extending ACCDOA-style coupling to other sensor fusion or multimodal spatial localization tasks (Shimada et al., 2020, Shimada et al., 2021).

References

"Sound Event Localization and Detection Using Activity-Coupled Cartesian DOA Vector and RD3net" (Shimada et al., 2020)
"ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection" (Shimada et al., 2020)
"Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training" (Shimada et al., 2021)
"Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection" (Shimada et al., 2021)
"Zero- and Few-shot Sound Event Localization and Detection" (Shimada et al., 2023)