SonicBench: Perceptual Benchmark for LALMs
- SonicBench is a psychophysically calibrated benchmark that evaluates large audio language models by isolating fundamental sound properties such as pitch, loudness, and timbre.
- It employs controlled stimulus generation, using differences above human just-noticeable-differences and dual evaluation paradigms (recognition and comparison) to probe model performance.
- Empirical results reveal that many models perform near chance on low-level auditory tasks, highlighting a disconnect between high-level semantic processing and sensory perception.
SonicBench is a psychophysically calibrated benchmark for evaluating the perceptual grounding of Large Audio LLMs (LALMs), with an explicit focus on measuring model sensitivity to low-level physical attributes of sound such as pitch, loudness, spatial characteristics, and timbral qualities. Unlike prior benchmarks that concentrate on semantic or paralinguistic performance, SonicBench systematically probes the ability of models to extract, discriminate, and reason about core physical properties that are foundational to auditory intelligence. Its methodology combines controlled stimulus generation exceeding human just-noticeable-differences (JNDs), dual paradigms of recognition and relative comparison, and rigorous protocol to localize perceptual bottlenecks in modern multimodal architectures (Sun et al., 16 Jan 2026).
1. Rationale and Conceptual Foundations
Contemporary LALMs have demonstrated state-of-the-art capabilities across a range of semantic ("what is being said?") and paralinguistic ("who is speaking?") tasks. However, empirical investigation into their proficiency at perceiving basic physical cues—attributes such as pitch, brightness, loudness, directionality, and reverberation—remained incomplete. Human perceptual expertise critically depends on precise encoding of such cues for scene understanding, source identification, and communication.
SonicBench was designed to address this gap by instantiating a test-bed where these attributes are isolated and systematically varied using established psychophysical margins. The benchmark draws five perceptual dimensions: Spectral & Amplitude (pitch, brightness, loudness, velocity), Temporal (duration, tempo), Spatial & Environmental (direction, distance, reverberation), Timbre (instrument/source identity), and Scene-Level (event counting), each operationalized by specific, manipulative signal properties. All twelve core attributes selected exhibit universal presence in acoustic environments, well-studied human JNDs, and precise signal-processing recipes for isolation (Sun et al., 16 Jan 2026).
2. Stimulus Generation and Benchmark Structure
SonicBench introduces a customizable and extensible stimulus generation pipeline with two principal modes:
- User-customized mode: The user provides a reference audio segment and a specification of the intended change in a particular physical attribute. Attribute-specific signal processing is then applied (e.g., pitch shift via PSOLA or phase-vocoder, loudness controlled by LUFS-based gain adjustment, spatial cues by Head-Related Transfer Function for direction, convolution with measured impulse responses for reverberation).
- Large-scale sampling mode: The benchmark samples attributes across their domain. For instance, in a pitch comparison task, fundamental frequency () is sampled from a musical range with (well above the ~10-cent JND in humans), and all other attributes are held constant or symmetrically randomized.
A core design principle is that all presented stimuli differ only along the tested cue, with all other features equilibrated to prevent confounding effects. All physical changes are set at least twice the human JND value, ensuring that performance deficits reflect architectural or representational limitations rather than marginal discriminability (Sun et al., 16 Jan 2026).
3. Evaluation Paradigms and Protocol
SonicBench features two complementary evaluation paradigms:
- Recognition (Absolute Judgment): The model is presented with a single 4-s clip and a binary prompt (e.g., "Is the pitch high or low?"). The output is classified as either 'A' or 'B'.
- Comparison (Relative Judgment): The model receives a concatenated stimulus pair (e.g., 4-s clip A, 0.5-s silence, 4-s clip B), then must judge ("Which clip is louder?").
Stimuli construction uses explicit signal-processing formulae, such as:
- Pitch shift in cents:
- Loudness difference (LUFS):
- Direction sampling:
- Reverberation application:
Each model is evaluated under strict zero-shot prompting protocols, with accuracy (fraction of correct A/B judgments) and abstention rate (fraction with non-extractable outputs) as primary metrics. A human baseline is reported as 91% overall accuracy under identical stimulus conditions (Sun et al., 16 Jan 2026).
4. Empirical Findings and Model Analysis
Across a representative sample of 36 systems—including LALMs, Large Audio Reasoning Models (LARMs, with Chain-of-Thought), and Omni LLMs (OLMs)—benchmark runs reveal:
- Nearly half of models perform at or near random (50%) across all tested attributes, even at stimulus differences designed to be trivial for humans.
- The top-performing open-source model achieves 72% overall, still well below human baselines.
- Human participants reliably outperform on comparison tasks versus recognition (an advantage of ~15%), whereas models generally show no such benefit, and in many cases, comparison and recognition accuracy are equivalent or reversed.
- Chain-of-Thought augmentation provides negligible or negative impact for most systems, with one exception where explicit reasoning ameliorated label bias.
- Scaling decoding-time resources does not mitigate these low-level perception deficits.
This suggests a significant disconnect between semantic integration in LALMs and their representation or use of physical auditory features (Sun et al., 16 Jan 2026).
5. Linear Probing and Bottleneck Localization
To dissect where in the architectural pipeline perceptual information is lost, linear probing analysis was conducted:
- For eight representative LALMs, the audio encoder was extracted and frozen.
- A two-layer probe (Linear MeanPool Linear) was trained separately per attribute and task.
- Frozen encoder probes reliably achieved ≥ 60% accuracy (often > 80% for pitch, brightness, reverberation), sharply exceeding the full end-to-end (E2E) model's performance (typically 50–65%).
- Allowing the encoder to update during additional training yielded modest gains for select attributes such as tempo or distance, but the gap with the probe persisted.
- The Qwen-Omni family was the sole architecture where encoder-level gains carried through E2E, presumably due to intentional co-training of the encoder and LLM alignment.
This evidence localizes the bottleneck primarily within the cross-modal alignment and decoding stages: encoders capture relevant physical cues, which are subsequently attenuated or erased during projection and large model integration (Sun et al., 16 Jan 2026).
6. Implications, Limitations, and Prospective Directions
SonicBench demonstrates that—even as LALMs attain high competence in semantically complex tasks—their grounding in low-level physical audio signals remains inadequate, limiting robust scene understanding and audio reasoning. Linear probe transparency highlights that perceptual cues are present in intermediate representations but are not preserved through to the language-modelling interface.
Key implications include:
- Alignment module optimization: Research into learnable cross-modal projectors and partial encoder unfreezing holds promise for preserving the signal-level cues necessary for perceptual competence.
- Inductive biases and joint objectives: Injecting relational inductive biases (e.g., contrastive or cross-segment attention) may render comparison tasks algebraically native rather than post-hoc. End-to-end objectives jointly optimizing physical perception and high-level reasoning may be beneficial.
- Benchmark extensibility: The SonicBench experimental protocol—with its tight control of attribute manipulation, psychophysical margins, and dual-paradigm interrogation—provides a model-agnostic methodology for evaluating new architectures and alignment strategies.
A plausible implication is that progress in grounding LALMs at the sensorimotor level will depend less on scale or further instruction tuning, and more on architectural innovations that ensure signal-level properties are faithfully represented and operationalized through the entire inference stack (Sun et al., 16 Jan 2026).
7. Relationship to Codec Benchmarks and Multidimensional Audio Evaluation
SonicBench's methodology and philosophy contrast with "AudioCodecBench," which emphasizes the evaluation of discrete audio tokenizers on axes of waveform reconstruction, token stability, representation perplexity, and semantic probe task performance. While AudioCodecBench offers a systematic protocol for multidimensional codec evaluation—including rigorous definitions of semantic and acoustic tokens, and their combination in fused or decoupled architectures—SonicBench focuses exclusively on physical property perception via controlled benchmarks rather than token-discrete modeling (Wang et al., 2 Sep 2025).
Both benchmarks reflect a broader shift from single-domain, single-metric evaluation toward holistic, multi-axis characterizations of audio models. They serve complementary roles: SonicBench interrogates perceptual grounding at the attribute level; AudioCodecBench provides a framework for the fair comparison of codecs with respect to information retention across both semantic and acoustic domains. Insights from both are relevant for designing next-generation audio models with tightly-coupled perceptual and semantic faculties (Wang et al., 2 Sep 2025, Sun et al., 16 Jan 2026).