WavJEPA: Audio & Jet Physics Analysis
- WavJEPA is a dual-purpose methodology that integrates a self-supervised predictive framework for audio and a wavelet-based approach for jet physics.
- The audio variant leverages CNN and transformer-based encoders to learn high-level representations from raw waveforms, outperforming spectrogram methods.
- The jet physics variant applies 2D discrete wavelet transforms to denoise and enhance jet substructure recognition in high-energy collision data.
WavJEPA refers to two distinct, high-impact methodologies leveraging wavelet-based and predictive joint-embedding paradigms for (1) jet physics and event analysis in high-energy collisions (Monk, 2014), and (2) self-supervised learning of audio representations from raw waveforms (Yuksel et al., 27 Sep 2025). Both methodologies exploit the ability of wavelet or transformer-based architectures to capture multi-scale, localized structure within signals, but target different domains and use substantially different algorithms. The contemporary usage of "WavJEPA" in machine learning generally refers to the audio joint-embedding predictive architecture, though the term has a legacy in collider physics.
1. Definition and Motivation
WavJEPA in the context of audio modeling designates a self-supervised, end-to-end system that learns high-level semantic representations from unprocessed audio waveforms, eschewing traditional spectrogram-based front ends and their drawbacks (e.g., phase loss, computational latency). WavJEPA operationalizes the Joint-Embedding Predictive Architecture (JEPA) directly on raw audio, yielding strong universality across speech, environmental, and musical tasks and demonstrating superior performance and robustness compared to prior time-domain models (Yuksel et al., 27 Sep 2025).
Conversely, the WavJEPA concept in collider physics refers to "Wavelet Analysis for Jet Evolution, Pre-processing and Analysis," which leverages discrete wavelet transforms to de-noise and analyze hadronic event images, enhancing jet substructure recognition and event-shape measurements without explicit jet clustering (Monk, 2014).
2. WavJEPA for Audio: Architecture and Training Protocol
The WavJEPA audio framework comprises four principal components (Yuksel et al., 27 Sep 2025):
- Waveform Encoder (): Adapted from Wav2Vec 2.0, this stack of six 1D CNN layers transforms 2 s audio sequences () sampled at 16 kHz into framewise embeddings (, ).
- Context Encoder () and Predictor (): Both utilize Vision Transformer architectures (ViT-B and ViT-S, respectively). Context tokens are sampled as random, non-overlapping blocks of length across , designating 10–20% as context. Prediction targets comprise separate blocks (also 10 frames each), with masked tokens inserted at these positions in the input to .
- Target Encoder (): A separate ViT-B network whose weights are maintained as an exponential moving average (EMA) of the context encoder, , with ramped from 0.999 to 0.99999 during warm-up. Instance normalization and top- (typically ) layer averaging further stabilize and enhance representation quality.
The model is optimized by minimizing the mean squared error between predicted embeddings and masked target codes over candidate blocks: Overall loss is averaged over the training corpus.
3. Data Regimes, Preprocessing, and Robustness Extensions
WavJEPA is pre-trained on the unbalanced AudioSet corpus (1.74 M clips), with recordings resampled to 16 kHz and segmented into 2 s windows. Each segment is mean-centered and instance-normalized (Yuksel et al., 27 Sep 2025).
For robustness to realistic acoustic degradations, WavJEPA-Nat extends the baseline by simulating 85k binaural room impulse responses (RIRs) with SoundSpaces2.0/MatterPort3D, overlaying spatialized WHAMR! noise (SNR 5–40 dB). Input is two-channel, with dual waveform encoders, and 2D positional encoding replaces the 1D variant. The same JEPA objective is applied, ensuring targets and context blocks are consistent between channels.
4. Empirical Benchmarking and Key Ablations
Comprehensive evaluation on HEAR (Holistic Evaluation of Audio Representations), ARCH, and NatHEAR benchmarks demonstrates WavJEPA’s efficacy:
- Clean-Audio: WavJEPA-Base (≈90M parameters, pre-trained for 375k steps) achieves on HEAR (vs. best spectrogram MAE ≈55) and on ARCH. Gains are largest for acoustic event classification (e.g., +13.8% mean AP on FSD50K, +19.1% on ESC-50).
- Noise/Reverberation Robustness: On NatHEAR, WavJEPA yields ( vs. HEAR), outperforming other waveform models. Dual-channel WavJEPA-Nat further improves stability ( on NatHEAR, on HEAR).
Key ablations confirm optimal performance with top-8 target-encoder layer averaging, , and in the 0.20–0.25 range. The system performs best when trained solely on simulated naturalistic data.
5. Computational Profile and Practical Adoption
WavJEPA-Base's parameter count (≈90M) is similar to prior “Base” speech models but with a dramatically lower data/time requirement—1.74M AudioSet clips rather than 50–60k hours of speech for typical SSL pre-trained models (Yuksel et al., 27 Sep 2025). Training over 375k steps utilizes a batch size of on dual NVIDIA H100 94GB GPUs. Inference latency is reduced to ≈10 ms (determined by the waveform encoder's receptive field), supporting real-time and streaming use cases.
Compared to spectrogram-based masked autoencoders (MAEs), which typically exceed 200M parameters due to patchification, WavJEPA is both more memory- and computationally-efficient.
6. WavJEPA in Collider Physics: Multi-scale Jet Analysis
Separate from audio modeling, WavJEPA ("Wavelet Analysis for Jet Evolution, Pre-processing and Analysis") in high-energy physics exploits 2D discrete wavelet transforms (DWT) on rapidity–azimuth () event images (Monk, 2014). The methodology:
- Rasterizes events into bins ( power-of-2, commonly 128), summing per bin.
- Applies 2D DWT (typically Daubechies d4), decomposing the image into scale/location-indexed coefficients.
- Thresholds coefficients below 1 GeV (or a pile-up–dependent, scale-based threshold) for denoising.
- Inverse DWT and particle reweighting restore a filtered image; pixel-wise reweighting is applied to input particles.
This approach enhances mass-peak sharpness, suppresses QCD background, and permits jet substructure studies (evolution profiles) without explicit jet clustering, serving as a robust pre-processing step for subsequent analysis.
7. Research Implications and Future Directions
The JEPA framework’s success in raw audio implies that semantic predictive objectives applied in latent space (rather than at the token/frame level) generalize more effectively across domains and tasks, especially for environmental, acoustic-event, and scene understanding applications. The dual-channel WavJEPA-Nat demonstrates that robustness to environmental variability is accessible through realistic simulation and architectural adaption (e.g., multi-channel networks, 2D embeddings).
Future directions posited include multi-modal pre-training (integrating speech with environmental audio), upscaling model size and data, and extending the paradigm to targeted tasks such as source separation, denoising, and generative modeling for real-world acoustic scenarios (Yuksel et al., 27 Sep 2025). In collider physics, further development may integrate wavelet-based preprocessing with advanced grooming and pile-up subtraction techniques.
Selected Benchmark Table for WavJEPA Audio Models
| Model | HEAR Score | ARCH Score | NatHEAR Score |
|---|---|---|---|
| WavJEPA-Base | 66.0 | 92.3 | 62.1 |
| WavJEPA-Nat | 60.0 | - | 61.2 |
Scores reported in (Yuksel et al., 27 Sep 2025)
WavJEPA thus denotes either (i) a state-of-the-art joint-embedding predictive architecture for raw audio, bridging semantic and robust representation learning with computational efficiency, or (ii) a multi-scale, wavelet-based approach for jet physics that enhances structure discovery and background suppression, illustrating the enduring relevance of multi-resolution, locally-adaptive transforms in signal interpretation across domains.