DeepShip & ShipsEar: UATR Benchmarks

Updated 2 February 2026

The paper details the creation of DeepShip and ShipsEar, benchmark datasets with 45–47 hours and 8 hours of recordings respectively, standardizing UATR protocols.
The paper demonstrates that using frozen pretrained acoustic models with linear probes enhances ship-type decodability despite challenges from recording-specific variance.
The paper recommends best practices such as diagnostic clustering, regularization, and domain-informed pretraining to overcome limited labeled data in marine acoustics.

DeepShip and ShipsEar Benchmarks

DeepShip and ShipsEar are cornerstone benchmarks for Underwater Acoustic Target Recognition (UATR), the task of classifying ship type from passive sonar recordings. Their establishment catalyzed substantive progress in transfer learning, model analysis, and robust recognition in marine acoustics. This article covers the defining attributes of these benchmarks, prevailing methodologies for UATR, comparative and diagnostic studies, and guidance for future research, drawing solely on evidence and results from the literature.

1. Dataset Composition, Annotation, and Task Protocols

DeepShip and ShipsEar provide contrasting but complementary test beds for ship-radiated noise classification:

DeepShip: Contains 45–47 hours of single-ship noise recordings from Vancouver waters. There are four vessel-type classes (e.g., cargo, passenger), with time-wise splits that emulate “future” deployment: the test set comprises later-in-time recordings than the train set, minimizing label leakage. All clips are strictly single-target, 10 s in length for most studies (Hummel et al., 13 Jan 2026).
ShipsEar: Originates from the Ría de Vigo on the northwest coast of Spain, totaling 8 hours, with five ship-type classes (some works report up to 11). The standard protocol is a random 80/20 split at the recording-file level to avoid segment overlap between train and test. Metadata often includes recording conditions, date, and, in extended versions, environmental parameters (Hummel et al., 13 Jan 2026).
Labeling and Preprocessing: For both benchmarks, ground-truth is categorical ship type. Preprocessing adheres to the requirements of each model but generally involves conversion to Mel-spectrograms (e.g., 128 bins, 16 kHz sample rate, FFT size 1 024, hop 512) or retaining raw waveform at rates from 8–48 kHz. Each clip is unique (one ship/clip), and train/test splits are non-overlapping by time or file.

These benchmarks are widely used for supervised as well as transfer-learned and unsupervised UATR, allowing for cross-study comparability (Hummel et al., 13 Jan 2026, Xie et al., 2024, Xu et al., 2023).

2. Acoustic Modeling Approaches and Transfer Learning

Multiple UATR modeling paradigms have been benchmarked on DeepShip and ShipsEar, illustrating the evolution from hand-crafted feature pipelines to deep transfer learning:

Frozen Pretrained Encoder + Linear Probe: Eighteen pretrained models from general audio (AudioMAE, BEATS, HuBERT-AS), speech (Wav2Vec 2.0, Data2Vec, WavLM, HuBERT), bioacoustic, and marine-life domains have been evaluated as frozen feature extractors, feeding a trained linear classifier. The embeddings’ geometry is typically dominated by recording-specific, not ship-type, variance (Hummel et al., 13 Jan 2026).
Deep Convolutional/Attention Backbones: Standard backbones (e.g., ResNet-18 + multi-head attention) trained “from scratch” or with data-augmentations, constitute the non-transfer learning baseline (Xu et al., 2023, Xie et al., 2024).
Mixture-of-Experts (CMoE): Models with multiple, trainable expert MLPs gated by lightweight networks that decide for each sample which “expert” processes the embedding (Xie et al., 2024). This allows fine-grained representation of high intra-class diversity (due to variations in operation, speed, or noise).
Regularized and Augmented Models: Smoothness-inducing regularization, which enforces output invariance to simulated perturbations, and spectrogram-specific augmentations (such as local masking and replicating, LMR) address overfitting and distributional mismatch (Xu et al., 2023, Xie et al., 2023).
Self-supervised and Contrastive Learning: Emerging methods employ unsupervised encoders or interpretable contrastive strategies, though these are less commonly benchmarked explicitly on DeepShip/ShipsEar (Xie et al., 2024).

3. Diagnosing Structure in the Embedding Space

A principal challenge in transfer learning for UATR is that pretrained audio embeddings organize primarily by recording-specific factors (sensor, ambient noise, location), not ship type:

Clustering by Recording, Not Class: t-SNE and PCA visualizations show strong clustering by recording id rather than vessel class. This is quantified by Normalized Mutual Information (NMI): K-means clusters (K = #classes) vs. class labels show low mutual information (NMI ≈ 0.00–0.30), while clusters vs. recording id labels yield NMI ≈ 0.62–0.66, indicating dominance of recording-specific information (Hummel et al., 13 Jan 2026).
Cosine Similarity Evaluation: ROC-AUC values for same-class versus same-recording embedding neighbor similarity confirm this pattern: same-recording AUC ≫ same-ship-type AUC (Hummel et al., 13 Jan 2026).
Ablation/Control: Label-shuffling experiments, where labels are randomly permuted across recordings, cause drastic accuracy drops (e.g., 65.4%→22.9% in DeepShip for BEATS embeddings), providing evidence of spurious, non-causal correlations in the embedding space (Hummel et al., 13 Jan 2026).

4. Linear Probing: Mechanism and Effectiveness

Despite the lack of global ship-type separation in frozen embedding space, a simple linear probe can extract discriminative subspaces:

Probe Formulation: Given frozen embedding $h \in \mathbb R^D$ , a weight matrix $W \in \mathbb R^{D \times C}$ and bias $b \in \mathbb R^C$ , the classifier is $y = \mathrm{softmax}(W^\top h + b)$ , trained with cross-entropy loss, typically for 50–100 epochs with Adam or SGD (Hummel et al., 13 Jan 2026).
Selection of Ship-relevant Dimensions: Feature-importance analyses demonstrate that a minority of embedding dimensions are crucial for accuracy. The linear head acts as a selective extractor of ship-type features, suppressing dimensions irrelevant for classification (Hummel et al., 13 Jan 2026).
Effect on Space Structure: After probing, NMI between K-means clusters on probe logits and class labels increases (to ≈0.36–0.42), indicating enhanced alignment with the true class structure (Hummel et al., 13 Jan 2026).
Head-to-head Model Results:

| Dataset | Top Models and Accuracy (%) | Logistic Baseline (%) | |-----------|-----------------------------------------|------------------------| | DeepShip | BEATS 65.4, Animal2Vec 64.0, BirdMAE 63.4 | 56.4 | | ShipsEar | Wav2Vec2.0 78.0, BEATS 74.0, Perch2.0 73.2 | 48.6 |

These outcomes substantiate that linear probing substantially improves ship-type decodability over raw embeddings and logistic Mel baselines in both low- and moderate-data regimes (Hummel et al., 13 Jan 2026).

5. Evaluation Metrics and Protocols

Benchmarking UATR methods on DeepShip and ShipsEar employs rigorous, multi-dimensional evaluations:

Supervised Classification: Segment-level accuracy (Top-1), measured on strictly held-out test sets.
Unsupervised Clustering: K-means clustering (K = #classes) on embeddings/logits, with NMI between cluster assignments and ground-truth labels.
Similarity Ranking: Given an embedding, test whether nearest neighbors belong to the same class. This is aggregated as ROC-AUC for the positive/negative class (same/different type) within the test set.
Statistical Analysis: Most works report single-run accuracy; confidence intervals or statistical tests are generally not reported in published tables (Hummel et al., 13 Jan 2026).

This evaluation regime allows assessment not only of classification accuracy, but also of the intrinsic “structure” and separability of the learned representation.

6. Comparative Findings Across Architectures and Pretraining Domains

Experiments consistently reveal the following domain-dependent effects:

General-Audio and Bioacoustic Models: Transfer best for ship-radiated noise; e.g., BEATS and bioacoustic models (Animal2Vec, BirdMAE) consistently outperform speech and marine-life-sound specific backbones for UATR (Hummel et al., 13 Jan 2026).
Speech Models: Transfer moderately well (notably, Wav2Vec 2.0 achieves 78.0% accuracy on ShipsEar).
Marine-Life-Sound Models: Models matched to marine mammal sounds (Google Whale, SurfPerch) perform poorly on ship classes in DeepShip/ShipsEar, likely due to distinct spectral characteristics.
Baseline and Data Limitations: Training deep models from scratch is infeasible on these benchmarks, as datasets are too small to avoid overfitting or undertraining large parameter backbones. No fine-tuned (full model) baselines surpass linear probing in these studies (Hummel et al., 13 Jan 2026).
Practical Recipe: For low-data UATR:
- Freeze a large, pretrained general-audio or bioacoustic backbone.
- Attach and train a linear probe via cross-entropy.
- Use only hundreds of labeled examples/class.
- Validate against label-shuffling or control experiments to detect recording-id leakage.

7. Impact, Best Practices, and Recommendations

The DeepShip and ShipsEar benchmarks have led to the following conclusions:

Decodability Over Linear Separability: Although ship-type information is not linearly separable or well-clustered in frozen spaces, it is decodable by linear selection from high-capacity embeddings. This property is essential for transfer learning in limited-label marine acoustic tasks (Hummel et al., 13 Jan 2026).
Model/Probe Design: Rather than elaborate, nonlinear heads, a sparse linear probe head with cross-entropy loss suffices; feature-importance and clustering diagnostics can be used post-hoc to confirm probe focus on semantically relevant subspaces.
Domain-Informed Pretraining: For small marine acoustic datasets, general-audio and diverse-bioacoustic pretrained models yield superior performance over speech or marine-life-only pretraining.
Error Detection: Rigorous control experiments (label shuffling, recording-id NMI) are mandatory to guard against spuriously high test accuracy from recording leakage or dataset-specific artefacts.
Scalability and Compute: The transfer+probe paradigm allows practical, compute-efficient UATR pipelines that avoid the need for extensive fine-tuning or prohibitive data augmentation.

In summary, DeepShip and ShipsEar have become the de facto standard benchmarks for UATR. Their use has established best practices with frozen backbone transfer, linear probing, and diagnostic embedding space analysis, enabling accurate, robust ship-type recognition in the presence of limited labeled data and variable recording conditions (Hummel et al., 13 Jan 2026).