Underwater Acoustic Target Recognition
- Underwater Acoustic Target Recognition is the automated process of identifying and classifying underwater acoustic signals using passive sonar and specialized deep learning methods.
- It addresses challenges like multipath propagation, noise interference, and limited labeled data through architectural innovations including CNNs, Transformers, and graph-based models.
- Innovative strategies such as regularization, data augmentation, and multimodal fusion enhance accuracy and robustness in real-world underwater environments.
Underwater Acoustic Target Recognition (UATR) is the automated identification and classification of underwater acoustic sources—typically ship-radiated noise or marine objects—using passive sonar recordings. UATR systems address unique technical challenges arising from the complex propagation characteristics of underwater sound, non-stationary and overlapping signal sources, and significant scarcity of labeled data. Recent research emphasizes architectural innovation, regularization methods, interpretable models, and task-specific feature engineering to address data diversity, overfitting, robustness, and practical deployment constraints.
1. Fundamentals and Domain-Specific Challenges
The underwater acoustic channel is characterized by multipath propagation, scattering, frequency-dependent attenuation, and high levels of ambient and anthropogenic noise. Signal features vary unpredictably with vessel type, environment conditions (temperature, salinity, wind, depth), and recording geometry. These factors result in high intra-class diversity—where recordings from the same class (vessel type) differ significantly—and inter-class similarity—acoustic patterns shared across classes due to common machinery or propulsion (Xie et al., 2024). Data scarcity is endemic: acquiring and annotating large-scale underwater datasets is expensive and technically demanding (Xu et al., 2023, Xie et al., 2023). Overfitting, domain shift, and poor generalization are persistent issues.
Conventional supervised learning pipelines (handcrafted features + SVM or GMM, or fixed CNN architectures) struggle on such data due to limited robustness and contextual modeling. Deep learning approaches (CNNs, Transformers, GNNs, MoE) now dominate for their hierarchical feature extraction and capacity to learn complex relationships.
2. Feature Extraction, Representations, and Neural Descriptors
Feature engineering in UATR typically centers on time–frequency representations tailored to physical signal attributes:
- Spectrograms (STFT, Mel, Bark, CQT): Core input modalities. STFT and Mel-filterbanks emphasize line spectra and periodic harmonics affiliated with engine or propeller noise (Xu et al., 2023, Xie et al., 2024). CQT enhances high-frequency temporal modulations, supporting fine discrimination tasks.
- Edge, Histogram, and Texture Descriptors: NEHD extracts both structural (edge) and statistical (histogram) textures from spectrograms, yielding unparalleled computational efficiency and competitive accuracy compared to large pre-trained CNN or Transformer models, despite a drastic reduction in parameters (Agashe et al., 17 Mar 2025).
| Model | Accuracy (%) | Parameters (K) |
|---|---|---|
| PANN-CNN14 | 69.92 ± 1.00 | 79,700 |
| ResNet-50 | 65.63 ± 0.46 | 23,500 |
| NEHD | 65.80 ± 0.41 | 13.6 |
For raw waveforms, biologically-inspired models such as ATCNN use depthwise-separable convolutions and time-dilated blocks to mirror cochlear decomposition and long-term auditory integration (Hu et al., 2020), achieving superior performance over handcrafted and conventional 1D CNN/CRNN methods.
3. Regularization, Augmentation, and Data Diversity Solutions
Advanced UATR methods confront limited data and class imbalance through specialized regularization and augmentation:
- Smoothness-Inducing Regularization: Training objective combines cross-entropy loss on clean data and a KL-divergence penalization between clean and noisy predictions, with weighting. Simulated signals are only used in the regularization term to avoid propagating label errors from unrealistic data, yielding smoother, robust decision functions (Xu et al., 2023, Xie et al., 2023).
- Spectrogram-Specific Augmentation (LMR): Local Masking and Replicating replaces random spectrogram patches with patches from other classes, maintaining physical realism and allowing the network to directly learn overlapping inter-class spectral structures. Unlike linear mixup, LMR avoids aliasing harmonic and temporal patterns (Xu et al., 2023).
- Adaptive Data Pruning: Discards near-duplicate segments based on cross-entropy similarities of CNN embeddings, mitigating the double-descent phenomenon and sample redundancy prevalent in mechanical periodic noise settings (Xie et al., 2023).
- Self-Supervised and Contrastive Embedding Learning: Conformer-based contrastive learning (with VICReg loss) optimizes representations using vast streams of unlabeled data, then transfers to supervised UATR tasks with competitive accuracy and significant cross-domain robustness (Hummel et al., 19 May 2025). Template-based tri-modal contrastive learning links audio, spectrogram, and descriptive text information into a unified metric space, boosting generalization and sample efficiency (Xie et al., 2023).
4. Architectural Innovations: Mixture-of-Experts, Multi-Task, and Attention
Emerging architectures improve fine-grained discrimination, adaptability, and generalization:
- Convolution-based Mixture of Experts (CMoE): A routing network dispatches high-level ResNet embeddings to multiple expert MLP heads, each specializing in different intra-class variances or physical characteristics. A balancing regularizer prevents load collapse; optional residual expert guards against routing errors (Xie et al., 2024).
- Multi-Task Learning and Multi-Gate Experts: M3 and M3-TSE frameworks integrate shared and private experts, gated via Welch-spectrum-derived scores. Auxiliary tasks (target size estimation) infuse physical priors into the learning process, yielding improved accuracy and robustness (Xie et al., 2024).
| Method | Type Acc (%) | Auxiliary Acc (%) |
|---|---|---|
| Single-task ResNet-18 | 83.62 ± 1.22 | 87.21 ± 0.00 |
| M3-TSE | 87.07 ± 2.43 | 90.52 ± 1.23 |
Multi-task balanced attention CNNs (MT-BCA-CNN) fuse channel attention with reconstruction heads to efficiently handle few-shot scenarios, highlighting key harmonic channels and suppressing environmental noise (Huang et al., 17 Apr 2025).
- Adversarial Multi-Task Learning: AMTNet enforces vessel-type recognition while simultaneously adversarially removing environmental factor information (source range, depth, wind speed), leading to representations that are invariant to acquisition condition (Xie et al., 2024).
5. Multimodal and Non-Euclidean Representation
UATR increasingly leverages data from multiple domains and explores non-Euclidean embeddings:
- Multimodal Fusion and Symbiotic Transformers: Parallel Transformer branches for audio, video, and text (with HetNorm statistical alignment) and multi-channel cross-attention enhance discriminative power and noise robustness, achieving SOTA across 91.7% of recognition metrics and all localization metrics (Liu et al., 2023).
- Graph-Embedded Transformers (UATR-GTransformer): Mel-spectrograms are patchified and encoded as graphs with nodes as local spectral patches. Transformer attention captures global features, while GNN layers model local Patch relationships—yielding top performance and interpretability via attention and graph saliency (Feng et al., 12 Dec 2025).
| Model | Overall Accuracy (ShipsEar) |
|---|---|
| ResNet-18 | 0.799 |
| UATR-GTrans | 0.832 |
6. Transfer Learning, Large Models, and Retrieval Paradigms
Transfer learning with large pretrained models is a central trend:
- Speech Large Model Transfer: Full fine-tuning of speech Transformer encoders (SenseVoiceSmall, 234M parameters) on underwater acoustic data overcomes scale and domain differences, yielding >99% in-domain and ~97% cross-domain accuracy, robust to extreme clip-length variation (Huang et al., 26 Jan 2026).
- Frozen Audio Embeddings and Linear Probing: Pretrained general audio and bioacoustic models (e.g., BEATS, AudioMAE) produce embeddings dominated by recording-specific variance. However, a linear probe can extract ship-type information, enabling strong UATR in sparse-label regimes (Hummel et al., 13 Jan 2026).
- Contrastive Multimodal Retrieval: Datasets such as Oceanship (121h, 15 vessel types, rich AIS metadata) support cross-modal contrastive audio–text pretraining and retrieval. Patch-based transformers (Oceannet) utilizing LoRA adapters deliver exceptional recall and zero-shot generalization well beyond previous baselines (Li et al., 2024).
7. Benchmark Datasets and Future Directions
The field is propelled by new, publicly released datasets:
- QiandaoEar22 (20 classes, 9h ship noise, 22h background): Benchmarking with DenseNet and spectrum+MFCC features yields up to 99.56% accuracy for specific ship identification; spectrum and MFCC provide maximal physical discriminability (Du et al., 2024, Du et al., 2024).
- Oceanship (121h, 15 vessel types, full AIS): Enables fine-grained, meta-annotated training and sets new standards for generalization and multimodal integration (Li et al., 2024).
Several directions remain active for future research:
- Domain adaptation and self-supervised methods to mitigate label scarcity and environmental shift (Xu et al., 2023, Xie et al., 2023).
- Exploring non-Euclidean and graph-based models for complex signal topology (Feng et al., 12 Dec 2025).
- Multimodal information fusion and knowledge graph construction for explainable UATR (Liu et al., 2023, Xie et al., 2023).
- Lightweight, interpretable, and real-time architectures for edge deployment (Agashe et al., 17 Mar 2025, Huang et al., 26 Jan 2026).
- Automated, template-driven annotation and contrastive alignment (Xie et al., 2023).
UATR research demonstrates that combining domain-specific regularization, innovative neural architectures, robust feature engineering, and principled multimodal fusion leads to models that achieve both high accuracy and generalization, even under challenging real-world conditions and severe data scarcity.