SAM Audio Judge (SAJ)

Updated 28 January 2026

SAM Audio Judge (SAJ) is a unified multimodal evaluation tool that uses text, visual, and temporal prompts to generate perceptually aligned scores.
It integrates diverse modality encoders within a Transformer fusion module to compute recall, precision, faithfulness, and overall quality metrics that closely mirror expert judgments.
SAJ enables practical applications like output reranking, data filtering, and difficulty stratification while demonstrating near-human reliability and scalability in benchmarking.

SAM Audio Judge (SAJ) is a unified, multimodal, reference-free framework for evaluating audio separation systems with high fidelity to human perception. Designed to address the persistent misalignment between objective metrics and subjective listening tests, SAJ provides fine-grained, prompt-aware scoring across speech, music, and general sound domains. Leveraging text, visual, and temporal prompts, SAJ outputs perceptual judgments that strongly track expert listening scores, enabling scalable, automatable benchmarking, reranking, and data curation in audio separation research (Wang et al., 27 Jan 2026).

1. Multimodal Architecture and Model Design

SAJ is architected as a prompt-conditioned, reference-free evaluator that integrates disparate modality encoders within a Transformer-based multimodal fusion module. The main model components are:

Audio and Text Encoders: Both the input mixture $x$ and separated output $\hat{x}$ are transformed into frame-level embeddings $A_x$ and $A_{\hat x}$ using a frozen PE-AV audio encoder backbone, supporting robustness across audio domains. Text prompts $t$ are embedded into $T_t$ using the PE-AV text module.
Visual and Span Encoders: Visual prompts (e.g., segmentation masks on video frames) are encoded using the PE vision core, yielding $V_v$ . Temporal span prompts $s$ are mapped to a learnable embedding $S_s$ , aligned with audio frame rates.
Multimodal Fusion: All temporal embeddings are resampled to a unified frame count and concatenated. A self-attention block fuses these into $H_{\mathrm{temp}}$ , into which the text embedding $\hat{x}$ 0 is injected via cross-attention, producing a joint multimodal representation $\hat{x}$ 1. This representation is refined by a transformer stack and projected via linear heads to four scalar scores: recall, precision, faithfulness, and overall.
Proxy Pretraining: Before human-score fine-tuning, an auxiliary alignment task is used: the system is trained to distinguish true target stems from randomly selected non-targets, with a binary cross-entropy loss. This pretraining enhances cross-modal grounding, yielding a 2–6% improvement in final human-alignment correlations (Wang et al., 27 Jan 2026).

This architecture admits any combination of text, visual, or span prompts, allowing SAJ to flexibly match the prompted extraction modality used in upstream separation systems (Shi et al., 19 Dec 2025).

2. Supported Audio Domains and Evaluation Criteria

SAJ supports broad, domain-agnostic evaluation across:

Speech: Including multi-speaker utterances and speech enhancement tasks.
Music: Encompassing solo instruments, mixes, and full multi-instrumental music tracks.
General Sound Events: Such as environmental noises, animal calls, and mechanical sounds.

Each separated output is evaluated along four perceptual axes:

Recall ( $\hat{x}$ 2): Completeness of target recovery, defined by the relative overlap of target and prediction:

$\hat{x}$ 3

Precision ( $\hat{x}$ 4): Fraction of the extracted content that pertains to the true target:

$\hat{x}$ 5

Faithfulness ( $\hat{x}$ 6): Degree to which the recovered target preserves the original’s timbral/dynamic qualities, using a perceptual inverse-distortion proxy:

$\hat{x}$ 7

Overall Quality ( $\hat{x}$ 8): Holistic, Likert-style judgment that aggregates all dimensions, presented as a continuous value in [1, 5].

Grounding these scores in perceptual overlap and fidelity metrics is central to ensuring interpretability and cross-domain consistency (Wang et al., 27 Jan 2026).

3. Training Procedures and Calibration Protocol

The SAJ training pipeline consists of two carefully calibrated stages:

Proxy Pretraining: Hundreds of hours of simulated mixtures, targets, and misaligned outputs are used to optimize a binary alignment loss. AdamW optimizer is used with a learning rate of $\hat{x}$ 9, 5K-step warmup, and weight decay of 0.1, over 200K updates.
Human-Score Fine-Tuning: SAJ is further refined on 357.9 hours of human-labeled data, with three independent five-point ratings per sample for recall, precision, faithfulness, and overall. Loss is a hybrid of MAE and MSE, empirically superior to classification-based objectives. Data is loudness-normalized and stratified to ensure uniform coverage of the perceptual scale. All judgments are calibrated against a pool of 128 expert raters under a unified guideline, enabling cross-domain comparability of scores (e.g., a “4” in speech aligns with a “4” in music) (Wang et al., 27 Jan 2026).

4. Empirical Evaluation and Human Perceptual Alignment

The alignment of SAJ scores to human listening tests was established via extensive side-by-side evaluation and correlation analysis:

Baselines: CLAP cosine similarity (Wu et al., 2023), AES-PC, SI-SDR estimators, and few-shot LLM ratings were used for comparative evaluation.
Correlation Metrics:
- Speech overall: $A_x$ 0 (Pearson), $A_x$ 1 (Spearman)
- Music overall: $A_x$ 2, $A_x$ 3
- Sound overall: $A_x$ 4, $A_x$ 5
Distribution Analysis: Score distributions are well-calibrated and balanced; strong pairwise correlations ( $A_x$ 6) are observed between recall, precision, faithfulness, and overall quality sub-metrics.
Baseline Comparison: SAJ outperforms the strongest baseline (CLAP; $A_x$ 7) and other competitors on all axes and domains (Wang et al., 27 Jan 2026). This demonstrates near-human reliability and robust generalization.

5. Practical Applications

SAJ’s prompt-conditioned, reference-free structure enables diverse and scalable applications:

Reranking: SAJ selects optimal separation outputs from multiple candidates, improving mean opinion score (MOS) from 3.2 (random) or 3.6 (CLAP) to over 4.2 on SAM Audio-Bench (Wang et al., 27 Jan 2026).
Data Filtering/Pseudo-Labeling: Imposing an SAJ overall threshold (e.g., 4.2) on pseudo-labeled datasets retains only the highest quality candidates ( $A_x$ 8), while preserving human MOS above 4.0, more than doubling the efficiency of prior approaches.
Difficulty Stratification: An auxiliary SAJ difficulty model, trained on mixture plus text, predicts fine-grained task difficulty (levels 1–4), supporting robust benchmarking and test set curation.

SAJ operates efficiently, with inference times around 0.1 s per 10 s audio clip on GPU, making it suitable for real-time reranking and quality gating in large-scale pipelines (Shi et al., 19 Dec 2025).

6. Limitations and Opportunities for Future Development

Despite its strong performance, SAJ’s current formulation is subject to several acknowledged limitations:

Multi-Stem Extraction: Handling overlapping prompts (e.g., concurrent drum and vocal separation) remains an open extension.
User-Intent Modeling: SAJ does not yet incorporate application-specific aspects of “overall quality” (e.g., preferences for broadcast or assistive hearing).
Real-Time Operation: Streaming-capable, low-latency variants are not yet available.
Domain Generalization: Extension to specialized domains (e.g., bioacoustics, underwater audio) will require new data and calibration.

Further research is needed to overcome these challenges and extend SAJ’s applicability across a wider array of separation paradigms and real-world deployments (Wang et al., 27 Jan 2026).

7. Relationship to Adjacent Approaches

SAJ expands upon both classical and recent methodologies in audio evaluation. Unlike metrics such as SI-SDR that require ground-truth references, or LLM-based judges reliant on textual transcriptions, SAJ delivers prompt-conformant, domain-agnostic, and reference-free assessments. Related works—such as AudioJudge, which ensembles specialized large audio models for multi-aspect judgment (Manakul et al., 17 Jul 2025), or rationale-augmented LLMs for explainable prediction (Ge et al., 28 Aug 2025)—highlight the trend towards human-aligned, transparent, and scalable evaluation. However, these rely either on system-level text prompts or ASR intermediates, whereas SAJ directly fuses multimodal prompts and audio content for continuous scoring. In spatial audio, SAQAM similarly provides reference-free, deep-feature-based perceptual metrics for listening and spatialization quality (Manocha et al., 2022); yet, it does not generalize to arbitrary prompt-based separations or offer recall/precision axes.

SAJ’s unique integration of multimodal prompting, reference-free objective scoring, and rigorous benchmark alignment establishes it as a standard for data-centric, perceptually valid evaluation in audio separation research (Wang et al., 27 Jan 2026).