Fast Spectrogram Event Extraction via Offline Self-Supervised Learning: From Fusion Diagnostics to Bioacoustics

Published 23 Feb 2026 in eess.SP, cs.AI, and physics.plasm-ph | (2602.20317v1)

Abstract: Next-generation fusion facilities like ITER face a "data deluge," generating petabytes of multi-diagnostic signals daily that challenge manual analysis. We present a "signals-first" self-supervised framework for the automated extraction of coherent and transient modes from high-noise time-frequency data. We also develop a general-purpose method and tool for extracting coherent, quasi-coherent, and transient modes for fluctuation measurements in tokamaks by employing non-linear optimal techniques in multichannel signal processing with a fast neural network surrogate on fast magnetics, electron cyclotron emission, CO2 interferometers, and beam emission spectroscopy measurements from DIII-D. Results are tested on data from DIII-D, TJ-II, and non-fusion spectrograms. With an inference latency of 0.5 seconds, this framework enables real-time mode identification and large-scale automated database generation for advanced plasma control. Repository is in https://github.com/PlasmaControl/TokEye.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces a self-supervised framework that efficiently extracts both coherent and transient events from high-noise, time-frequency datasets.
It employs a multichannel nonlinear U-Net and CDF-based global thresholding to achieve high recall and zero-shot cross-domain performance.
The methodology enables real-time event segmentation for fusion diagnostics and bioacoustics, supporting advanced plasma control and scalable database creation.

Fast Spectrogram Event Extraction via Offline Self-Supervised Learning: Signal Processing from Fusion Diagnostics to Bioacoustics

Introduction and Motivation

The proliferation of high-bandwidth diagnostics in contemporary and future fusion devices such as ITER has produced an unprecedented volume of time-series signal data, making manual and heuristic processing infeasible. Traditional threshold- or filter-based approaches are ill-suited for extracting the physically relevant coherent and transient signatures due to nonstationary noise, diagnostic heterogeneity, and limited ground truth availability. This paper introduces a self-supervised, sensor-agnostic framework for large-scale, automated extraction of events—both coherent (e.g., MHD, Alfvén eigenmodes) and transient (e.g., ELMs)—from high-noise time-frequency datasets, enabling real-time analysis and robust database creation for advanced plasma control and physics discovery (2602.20317). The methodology is not only demonstrated on fusion datasets (DIII-D, TJ-II) but on non-fusion domains (bioacoustics), underscoring a broad generalizability.

Taxonomy of Signal Modes

A formal signal taxonomy is proposed that separates spectrogram events into five categories: coherent, quasi-coherent, transient, broad, and stochastic, further embedded in the broader classes of coherent, broadband, and noise (Figure 1). This data-driven categorization provides explicit signal separation priors essential for algorithmic design, especially when diagnostic-specific features are under-determined or unavailable. The taxonomy also delineates the boundaries between deterministic, random, periodic, and nonstationary behavior, facilitating both baseline removal and event segmentation.

Figure 1: Signal taxonomy with example modes and spectra.

Signal Processing Pipeline

The proposed pipeline (Figure 2) builds on a sequence of modular preprocessing and learning steps:

Time-Frequency Transform: Baseline STFT (Hann window, 500 kHz sampling) is selected for interpretability and compatibility. All downstream enhancement and detection algorithms operate agnostically on the resulting spectrogram.
Broadband Baseline Removal: Nonstationary broadband backgrounds—prevalent in turbulent or integrated measurements—are separated using robust baseline estimation (asymmetric least squares, high regularization), which whitens the spectrum and mitigates power-law biases while preserving narrowband and transient structures.
Multichannel Nonlinear Denoising: Building upon the classical cross-power spectrum, a U-Net is trained in a self-supervised arrangement to reconstruct each channel from all others using real and imaginary STFT components as input, thereby nonlinearly suppressing uncorrelated stochastic noise without erasing localized transients or weak coherent modes.

Figure 2: Signal processing pipeline for ECE shot demonstrating progressive separation of coherent modes from broadband background and stochastic noise.

The bias-variance tradeoff of linear averaging is illustrated (Figure 3), with the neural network estimator providing a less lossy, channel-aware alternative.

Figure 3: (left) Averaging signals can introduce a bias that removes individual channel information; (right) Multichannel nonlinear reconstruction preserves channel-unique events.

Results demonstrate that state-of-the-art single-image and blind-spot denoising methodologies are insufficient due to long-range correlations in sparse spectra (Figure 4).

Figure 4: Blind-spot denoising (AT-BSN) fails to fully suppress noise in sparse, correlated spectral data, especially with small convolution kernels.

A parameter-free, global thresholding method based on CDF knee-point detection is introduced to separate salient, physically meaningful events from spectrogram background—a substantial improvement over heuristic quantile or Otsu methods in the context of sparse, heavy-tailed distributions (Figure 5).

Figure 5: Top: original denoised spectrum; Middle: CDF and optimal threshold (knee); Bottom: binary segmented spectrum.

Detection refinement employs a secondary single-channel U-Net, leveraging symmetric BCE loss and elastic augmentations to recover edge-case or false-negative regions, further enhancing mask quality without manual intervention.

Surrogate Neural Representation

The processed and refined quasi-labels (about 40,000 spectrogram fragments) are used to train a robust, fast surrogate U-Net for direct event segmentation in new data, supporting real-time inference ( $<$ 0.5 s/shot on GPU). Surrogate networks are trained with SpecAugment, robust percentile normalization, and multi-scale windows (window sizes 256–2048) to ensure invariance with respect to input parameter variations. Figure 6 summarizes the end-to-end data extraction and surrogate modeling flow.

Figure 6: Automated data extraction pipeline and surrogate model training for real-time event extraction.

Architecture-wise, all U-Nets use a transpose-free upsampling scheme to avoid checkerboard artifacts (Figure 7).

Figure 7: U-Net architecture applied to self-supervision, segmentation refinement, and surrogate modeling.

Empirical Evaluation

The framework is validated in several empirical domains:

DIII-D Fusion Spectrograms: On magnetic, ECE, CO $_2$ interferometric, and BES channels, the pipeline yields high-fidelity, channel-specific event segmentation; e.g., efficient extraction of Alfvén modes and tearing/pre-tearing events with time/frequency localization (Figures 8, 9, 10, 11).
Generalization to TJ-II: Without retraining, the surrogate achieves 0.825 recall on expert-annotated ECE spectrograms, with only minor error at low-frequency cutoffs. The model effectively captures unannotated structures and background-quiet intervals (Figure 8).
Figure 9: Magnetic spectrogram segmentation (top), extracted events (middle), and amplitude-gated outputs (bottom) for coherent mode identification.
Generalization to Non-Fusion: Cross-domain benchmarking on bioacoustic datasets (DCLDE dolphin calls) demonstrates a recall of 0.77–0.80 in zero-shot configuration (Figures 13, 14). The only consistent performance drop is in single-pixel-annotated events at spectral boundaries, an issue configurable via post-processing for fusion-centric tasks.
Figure 10: DCLDE dolphin call spectrogram, with expert annotation (top) and model segmentation (bottom) at 50% threshold.
Computational Efficiency: End-to-end processing scales efficiently (5k segmentations in 5 hours on single A100); training takes $\sim$ 12 hours. Inference is real-time compatible: 0.5 s/shot on GPU, 5–10 s/shot on multicore CPU.

Theoretical and Practical Implications

Self-Supervised Scalability: The combination of taxonomy-driven modularization, self-supervised non-linear denoising, and parameter-free thresholding obviates the need for costly expert annotation, fundamental for scaling to petabyte-scale operations.
Diagnostic Agnosticism: The signal-agnostic formulation (including surrogate model invariance to TF parameters, normalization, and multi-channel priors) is crucial for both legacy data mining and new sensor integration in fusion environments with evolving diagnostics. Strong cross-domain generalization to bioacoustics attests to domain-robustness.
Real-Time Plasma Control: With inference latency $<$ 1 s, this architecture enables inter-shot/real-time mode detection and automated selection for active plasma control, supporting advanced techniques such as AE feedback [rothstein_initial_2024] and physics-driven, explainable event chains [farre-kaga_interpreting_2025].

Contrasts with Prior Art and Strong Claims

Bold Numerical Results: Achieved 0.825 recall (TJ-II), 0.77–0.80 recall (bioacoustics) in zero-shot evaluation.
Computational Performance: Surrogate model achieves 0.5 s/shot inference, robust to diagnostic input heterogeneity.
Model Generalizability: Demonstrated strong cross-device, cross-domain performance with no retraining.
Methodological Contrast: The authors demonstrate (Figures 3, 4) that prior methods (linear averaging, masked self-supervised denoising) fail to capture sparse, correlated events or induce noise artifacts, whereas their nonlinear, multichannel denoising achieves superior event reconstruction with minimal bias.

Future Directions

Extensions include more efficient self-supervised blind spot schemes for multi-channel scalability, explicit turbulence and phase information extraction, and integrated diagnostic correlation analysis in a unified model pass. Quantization/pruning for CPU inference, and integration with large physics interpretable models and model-based control frameworks, are clear application vectors.

Conclusion

This work provides an algorithmic and empirical foundation for fully automated, real-time spectrogram event extraction in fusion and other high-throughput scientific settings. Leveraging a sequence of self-supervised, physically informed signal processing modules, the framework provides robust, scalable, and diagnostic-agnostic event segmentation, with demonstrated transferability to non-fusion domains and suitability for real-time control scenarios.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper tackles a big problem in fusion energy research: modern fusion machines (like tokamaks) produce huge amounts of data every day, and it’s hard and slow for humans to find important events in that data. The authors built an AI-based method that automatically finds and highlights meaningful patterns in “spectrograms” (pictures that show how signal energy changes across time and frequency). Their system works fast, across different kinds of sensors, and even transfers to other fields like animal sound analysis.

What questions did the researchers ask?

Put simply, they asked:

Can we automatically and quickly find important signal events (like steady tones, chirps, and short bursts) in very noisy time–frequency data?
Can one general method work across many different fusion sensors (magnetics, ECE, CO₂ interferometers, BES) and on other devices (like TJ-II) without hand-tuning?
Can we do this in a way that doesn’t rely on lots of human labels, but still builds a useful database and runs fast enough for real-time monitoring?

How did they do it?

They designed a “signals-first” pipeline: start from the raw signals and systematically separate useful patterns from the background. Here are the main ideas in everyday terms.

Turning signals into pictures (spectrograms)

A spectrogram is like a piano-roll view of sound: it shows which “notes” (frequencies) are active at each moment.
They used a standard tool called the Short-Time Fourier Transform (STFT). Think of sliding a short window over the signal and, at each step, listing which frequencies are present.
This creates a time–frequency image where bright spots or lines can indicate interesting events.

Separating the background “hum” (baseline removal)

Many fusion signals have a strong low-frequency “slope” or rumbling background (like wind noise in a microphone).
The authors estimate this smooth background (“baseline”) and subtract it, which “whitens” the spectrogram. That makes faint, narrow patterns stand out more clearly without throwing away important low-frequency details.

Using many sensors together (multichannel denoising)

Fusion machines use many sensors at once—like having many microphones around a stage.
Instead of simply averaging (which can erase rare events), they train a neural network (a U-Net) that learns how signals relate across channels. It uses information from all the other channels to predict and clean up each target channel.
This self-supervised approach works without clean labels: it learns to keep consistent, shared patterns and suppress random noise.

Picking out the events (smart thresholding)

After cleaning, they need to decide which pixels in the spectrogram belong to real events.
Rather than using a fixed cutoff, they look at the overall intensity distribution and find the “knee” point (the place where the curve bends). It’s like choosing a sensible boundary between background and meaningful signal automatically.
This selects bright, sparse events (steady tones and short bursts) while avoiding over-selection.

Training a fast helper model (surrogate)

They use the cleaned and auto-labeled data to train a final “surrogate” U-Net that can quickly spot events on new spectrograms.
This model then runs in about half a second per shot on a GPU, enabling near real-time detection and building large databases automatically.

Before doing all this, they also define a simple taxonomy (categories) of what they’re looking for:

Coherent (steady or slowly changing tones)
Quasi-coherent (mostly steady but a bit fuzzy)
Transient (short spikes, like claps)
Broad (wide-band background or drifts, like a steady rumble)
Stochastic (random noise)

What did they find?

Here are the main results, explained in plain language:

The pipeline can reliably separate faint, steady patterns (like Alfvén modes) from heavy background and noise across multiple sensor types (magnetics, ECE, CO₂ interferometers, BES) on the DIII-D tokamak.
It uncovers interesting physics automatically—for example, changes in high-frequency activity during tearing mode suppression (an important control topic in fusion).
The surrogate model generalizes well:
- On the TJ-II stellarator’s data, it matched expert labels strongly without retraining.
- On marine bioacoustics (dolphin calls), it detected vocalizations in a zero-shot setting (no retraining), showing the method works beyond fusion.
It’s fast: around 0.5 seconds per shot on a GPU, which is suitable for real-time or inter-shot analysis. A CPU run is slower (about 5–10 seconds), but still practical.
It created a large, auto-labeled database (tens of thousands of spectrogram snippets), which helps future AI training and science studies.

Why is this important?

Fusion devices are moving toward “burning plasma” conditions and will produce enormous data (petabytes per day in future facilities). Manually scanning that data isn’t feasible.
Automatic, reliable detection of events helps scientists understand and control the plasma more quickly, potentially preventing damaging disruptions.
A single general method reduces the need to handcraft different tools for different sensors, saving time and reducing mistakes.
The approach builds high-quality training data without heavy human labeling, which is crucial for modern AI systems.
Because it generalizes to other machines and even other fields (like bioacoustics), it can accelerate scientific discovery across domains.

Takeaway

The authors created a practical, fast, and general AI pipeline that turns messy, noisy spectrograms into clear maps of important events—without relying on lots of human labels. It works across different fusion diagnostics, transfers to other devices and domains, and runs quickly enough to help with real-time plasma monitoring and control. This kind of tool can make a big difference as fusion research moves into data-heavy, high-performance operation.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a consolidated list of the paper’s unresolved knowledge gaps, limitations, and open questions, framed to be actionable for future research.

Formal validation of the proposed signal taxonomy: No quantitative or expert-reviewed assessment of the five classes (coherent, quasi-coherent, transient, broad, stochastic) across devices, diagnostics, and operating regimes; need inter-rater agreement studies and confusion analyses between “broad” and deterministic non-periodic phenomena.
Separation of broad vs transient events: The paper suggests impulse-based heuristics but does not implement or evaluate a robust transient detector (e.g., ELM/sawtooth/pellet), nor quantify false separations of broad drifts vs turbulence.
Baseline removal parameters and generality: Fixed asymmetric least-squares settings (p=0.001, λ=10^6, pre-emphasis α=1) lack adaptive selection, calibration, or sensitivity analyses across sensors and shots; edge effects (up to 4 kHz) are noted but not systematically mitigated or quantified.
Objective evaluation of “whitening”: The claim that baseline removal whitens the residual is not validated with spectral whiteness tests (e.g., Ljung–Box, autocorrelation decay) or color assessment across frequencies and diagnostics.
STFT design choices: The selection of window size (N=1024), overlap (87.5%), and resampling to 500 kHz is not justified via ablations; effects on time-frequency resolution, aliasing (especially for CO2 signals originally at 1 MHz), and sensitivity to parameter changes remain unquantified.
Alternative T-F representations: The paper avoids wavelets/multitaper/synchrosqueezing for interpretability but does not benchmark them empirically against the proposed pipeline to quantify trade-offs (e.g., faint mode retention vs baseline suppression).
Complex-valued denoising fidelity: Operating on real/imag components lacks constraints to enforce STFT phase consistency; no evaluation of phase errors, spectral leakage, or artifact formation; open question whether complex losses (e.g., magnitude–phase) improve reconstruction.
Assumption of zero-mean, independent noise: The self-supervised denoising relies on noise independence across channels, which may be violated (electronics, environment); needs measurements of inter-channel noise correlation and robustness to colored residuals post-baseline removal.
Single-channel diagnostics: The multichannel predictor cannot recover information unique to a single channel; performance on single-channel sensors (or sparsely instrumented systems) is underexplored and lacks alternative denoisers tested.
TV-based early stopping: Total-variation criterion is proposed to avoid learning noise, but no thresholds, stopping rules, or correlation with noise content are reported; open whether TV is both necessary and sufficient across diverse noise types.
CDF knee-point thresholding stability: While parameter-free, behavior in pure-noise spectra (no meaningful modes), multi-modal histograms, and time-varying noise floors is not quantified; need robust fallback thresholds and local adaptivity tests.
Post-threshold morphological processing: The pipeline does not evaluate whether skeletonization, contour smoothing, or connectivity constraints improve precision without compromising recall, especially for narrow-band AE chirps and dolphin calls.
Detection refinement details: “High-entropy measurements” inclusion is not formalized; criteria, thresholds, and impact on label quality (false positives/negatives) are missing; risk of confirmation bias in self-generated masks remains unaddressed.
Label noise and dataset quality: The 40,000-fragment database is unlabeled and self-supervised; there is no systematic QA/QC protocol, human-in-the-loop validation, or error auditing to estimate label noise and its downstream effect on surrogate training.
Surrogate evaluation metrics: Performance is reported primarily via recall on TJ-II and DCLDE; precision, IoU, F1, and calibration curves (e.g., probability-threshold sweeps) are missing; device-specific differences and thresholds (50%) are not stress-tested.
BES results absent: Although BES is listed, no figures or quantitative metrics are reported; the pipeline’s robustness to BES-specific artifacts and lower SNR regimes remains untested.
Physics validation and interpretability: Mode-number (n,m) inference, radial localization, and cross-diagnostic coherence checks are not demonstrated; need end-to-end validation linking segmentations to known physics events (ELMs, NTMs, AEs) across many shots.
Integration with real-time control: Claims of suitability for advanced plasma control lack a demonstration of closed-loop integration (latency budgets, streaming architecture, trigger reliability, false-alarm rates) and risk analyses for control actions.
Cross-diagnostic fusion: The current approach reconstructs one view at a time; joint multi-diagnostic fusion, alignment across sampling rates, time synchronization, and conflict resolution (inconsistent detections) is an open design and evaluation space.
Mode classification vs binary segmentation: The surrogate produces binary masks of “coherent/transient” regions; a classifier to map detections into specific physics classes (e.g., AE, NTM, GAM, ELM, sawtooth) is absent and needed for science and control use.
Robustness to device/domain shifts: Generalization beyond DIII-D and TJ-II (e.g., JET, KSTAR, EAST, NSTX-U, ITER-like diagnostics) is untested; domain adaptation strategies and failure modes under changed noise characteristics and sampling are unknown.
Quantitative SNR improvements: No rigorous SNR/PSNR/SSIM or physics-specific metrics show how much denoising and baseline removal improve detectability of faint modes; sensitivity curves (minimum detectable amplitude/chirp rate) are missing.
Threshold sensitivity and uncertainty: No uncertainty quantification (e.g., predictive confidence maps, calibrated probabilities) accompanies detections; threshold selection procedures and their stability across diagnostics are not documented.
CPU deployment constraints: Inference on CPUs (5–10 s/shot) is slow for variable-size inputs; quantization/pruning is proposed but untested; memory footprint, throughput scaling, and energy costs for plant deployment remain open.
Scalability with channel count: The multichannel U-Net scales memory with 2k input channels; no profiling for tens/hundreds of channels (e.g., MHR arrays); need architectures that scale (blind-spot, attention, low-rank fusion) with demonstrated speed/memory trade-offs.
Choice of loss function: MAE on real/imag may not be optimal; investigation of complex-domain losses, magnitude–phase decompositions, or perceptual T-F losses tailored to mode detection is missing.
Pre-emphasis effects: The α=1 pre-emphasis may bias high-frequency regions; its impact on faint low-frequency NTMs and near-DC coherent modes is not studied; adaptive pre-emphasis or frequency-dependent regularization could be evaluated.
Baseline vs HPS vs morphological filtering: The paper motivates baseline removal via spectroscopy analogs but does not benchmark against harmonic–percussive separation or morphological filters on T-F planes; comparative studies are needed.
Data and code reproducibility: The repository link is provided, but details on datasets (availability, licenses), exact preprocessing pipelines, seeds, and environment (versions) required for deterministic reproduction are not documented.
Annotation mismatch in DCLDE: Precision is low due to annotation width differences; boundary-aware metrics (e.g., Hausdorff distance, boundary IoU) or label dilation studies are not conducted to fairly evaluate segmentation performance.
Handling coherent non-physical artifacts: Antenna scans and instrument-induced coherent signatures are acknowledged but not explicitly detected/filtered; a method to distinguish physical modes from coherent sensor artifacts is needed.
Streaming and online operation: The pipeline is offline; a streaming variant (sliding STFT windows, continuous baseline tracking, online thresholds) with bounded latency and drift correction is not developed or benchmarked.
Parameter auto-tuning: The pipeline uses several fixed parameters (STFT, ALS, TV thresholds, augmentation configs); automated per-shot/dataset tuning (e.g., via Bayesian optimization) and its stability across diagnostics is unstudied.
Data normalization/calibration: Diagnostic-specific normalization (cumulative mean/std, clipping at 0.1/99.9 percentile) may hinder cross-diagnostic amplitude comparability; mapping to physical units and cross-shot calibration procedures are not provided.
Evaluation on edge frequency bins: The approach acknowledges low-frequency edge effects but does not quantify detection reliability near DC or at the highest frequency bins; needs boundary-aware analyses and mitigation strategies.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are deployable use cases that can be adopted now using the paper’s released codebase (TokEye) and described methods, with sector links, potential tools/workflows, and key dependencies.

Automated inter-shot mode cataloging in fusion experiments
- Sectors: Energy (fusion), Software/AI
- Tools/Workflows: Batch processing of STFT spectrograms from ECE, CO2 interferometers, magnetics, BES; event masks stored in a searchable database; amplitude/chirp summaries auto-generated per shot
- Assumptions/Dependencies: Access to raw multi-diagnostic data; STFT preprocessing compatible with local pipelines; compute availability (GPU for ~0.5 s/shot or CPU for ~5–10 s/shot)
Real-time operator dashboards with mode overlays
- Sectors: Energy (fusion), HMI/Visualization
- Tools/Workflows: Live spectrograms with coherent/transient segmentation overlays; alarm thresholds for AE/ELM-like detections; integration with control room UIs
- Assumptions/Dependencies: Low-latency data stream; GPU in the loop for sub-second inference; stable STFT parameters; signal contains detectable features (knee-threshold requires signal presence)
Cross-diagnostic event alignment and quick-look physics analysis
- Sectors: Energy (fusion), Research/Academia
- Tools/Workflows: Synchronized event masks across ECE/CO2/MHD/BES channels; rapid identification of onset times, chirps, and amplitude envelopes; export to MDSplus-compatible tags
- Assumptions/Dependencies: Time synchronization across diagnostics; sufficient channel count for multichannel denoising (best performance with ≥2 correlated channels)
Rapid auto-label generation to train/benchmark ML for fusion control
- Sectors: Energy (fusion), Software/AI
- Tools/Workflows: Surrogate event segmentations used as labels for disruption predictors, NTM/AE classifiers, and anomaly detectors; dataset curation pipelines via FAITH + TokEye
- Assumptions/Dependencies: Agreement on label schemas within a group; acceptance of self-supervised labels for bootstrapping (with optional human QA)
Post-shot mode amplitude gating and chirp tracking
- Sectors: Energy (fusion)
- Tools/Workflows: Thresholded event masks gating amplitude and frequency bands; per-event statistics (fmin/fmax, duration, amplitude); CSV/JSON exports for analysis notebooks
- Assumptions/Dependencies: Stable thresholding; correct handling of low-frequency baseline bias (pre-emphasis and baseline removal tuned per diagnostic)
Diagnostic commissioning and health monitoring
- Sectors: Energy (fusion), Test & Measurement
- Tools/Workflows: Baseline removal to quantify noise floors, 1/f drifts, and edge artifacts; trend monitoring of sensor health across campaigns
- Assumptions/Dependencies: Regular calibration data; consistent acquisition settings to track drifts
Zero-shot bioacoustic call triage (marine mammal surveys)
- Sectors: Environment, Bioacoustics, Industry (offshore energy, shipping)
- Tools/Workflows: Batch segmentation of dolphin calls in passive acoustic datasets; pre-screening to reduce human annotation burden; triage for shrimp noise clutter
- Assumptions/Dependencies: Spectrograms computed with similar parameter ranges; acceptance that precision may be lower without fine-tuning; post-processing to match narrow annotation conventions
General-purpose spectrogram preprocessing library for labs
- Sectors: Academia, Software/AI
- Tools/Workflows: Drop-in Python modules for baseline removal (asymmetric least squares + TV), multichannel self-supervised denoising (Noise2Noise-inspired), knee-CDF thresholding
- Assumptions/Dependencies: Access to STFTs or compatible T–F transforms; zero-mean independent noise for best denoising performance
Pilot predictive maintenance on vibration/acoustic spectrograms
- Sectors: Manufacturing, Robotics
- Tools/Workflows: Apply baseline separation + denoising to motor/gearbox audio or accelerometer spectrograms; flag quasi-coherent harmonics and transients as early warnings
- Assumptions/Dependencies: Minimal fine-tuning likely needed; domain validation against labeled fault data; multichannel sensors improve robustness
Seismology and geophysics quick screening of tremor bursts
- Sectors: Earth Sciences
- Tools/Workflows: Rapid segmentation of tremor-like time–frequency patches in regional networks; triage for analyst review
- Assumptions/Dependencies: Parameter tuning for domain-specific frequency bands; validation on representative datasets
Teaching modules for time–frequency event extraction
- Sectors: Education, Academia
- Tools/Workflows: Open-source examples using TokEye/FAITH; lab assignments on baseline removal, denoising, and segmentation; reproducible pipelines
- Assumptions/Dependencies: Basic Python environment; spectrogram-ready datasets for classroom use

Long-Term Applications

These applications will benefit from further research, scaling, integration, or validation before widespread deployment.

Closed-loop plasma control (in-shot) for mode suppression/avoidance
- Sectors: Energy (fusion), Control Systems
- Tools/Workflows: Integration with PCS for ECCD targeting (NTM), fast AE/ELM triggers; quantized/pruned models on real-time controllers; sub-100 ms end-to-end latency
- Assumptions/Dependencies: Deterministic latency guarantees; hardware acceleration on PCS; rigorous safety and reliability testing
ITER/SPARC-scale automated labeling and data governance
- Sectors: Energy (fusion), Data Infrastructure, Policy
- Tools/Workflows: Petabyte/day pipelines to auto-label spectrograms; standardized event schemas; metadata catalogs and FAIR-compliant archives
- Assumptions/Dependencies: HPC resources; consensus on annotation standards; data access and retention policies across facilities
Joint cross-diagnostic inference (multi-view fusion)
- Sectors: Energy (fusion), Software/AI
- Tools/Workflows: Models that learn inter-sensor correlations (phase, turbulence, multi-physics) in one pass; improved sensitivity to faint modes
- Assumptions/Dependencies: Synchronized multi-sensor datasets; architectural advances to avoid memory/compute blowup with many channels
Explainable mode discovery and cross-device meta-analysis
- Sectors: Academia, Energy (fusion)
- Tools/Workflows: Large, curated event catalogs enabling statistical discovery of new mode families and operating envelopes; explainability overlays (e.g., saliency on T–F)
- Assumptions/Dependencies: Harmonized metadata across devices; reproducible pipelines; community tools for cross-lab analysis
Clinical EEG/ECG event detection and triage
- Sectors: Healthcare
- Tools/Workflows: Self-supervised denoising + segmentation for seizure/spike detection or arrhythmia bursts in T–F space; clinician triage dashboards
- Assumptions/Dependencies: Extensive clinical validation; regulatory approvals; strong privacy/security safeguards; domain-specific thresholds
Edge bioacoustic mitigation for marine operations
- Sectors: Environment, Maritime, Offshore Energy
- Tools/Workflows: On-buoy/on-vessel embedded inference for marine mammal detection; real-time alerts for ship-strike avoidance or survey shut-downs
- Assumptions/Dependencies: Efficient, low-power hardware; model compression; robust performance under variable ambient noise; stakeholder acceptance by regulators
Embedded anomaly detection in robotics/manufacturing assets
- Sectors: Robotics, Manufacturing, Industrial IoT
- Tools/Workflows: PLC/edge-device deployment for continuous vibration/acoustic monitoring; predictive maintenance scheduling
- Assumptions/Dependencies: Integration with existing PLCs/SCADA; domain adaptation; labeled failure data for benchmarking
Defense/sonar/radar signal extraction in cluttered environments
- Sectors: Defense, Aerospace
- Tools/Workflows: Self-supervised multichannel denoising and event segmentation for weak targets amidst broadband clutter
- Assumptions/Dependencies: Access to representative datasets; security/ITAR compliance; real-time processing constraints
Astronomy and remote sensing event detection (e.g., gravitational-wave glitches, radio bursts)
- Sectors: Space/Astronomy, Earth Observation
- Tools/Workflows: Automated T–F event extraction to triage large-scale spectrogram archives; anomaly discovery workflows
- Assumptions/Dependencies: Domain-specific preprocessing (e.g., whitening pipelines); validation against existing pipelines
Smart-home and on-device audio event detection with privacy
- Sectors: Consumer Electronics, Privacy Tech
- Tools/Workflows: Robust noise-aware event detectors (e.g., glass break, alarms) running locally; reduced false alarms in nonstationary noise
- Assumptions/Dependencies: Model compression for embedded hardware; curated datasets; privacy-by-design constraints
“Spectrogram Event Extractor” platform and standards
- Sectors: Software/AI, Policy, Standards
- Tools/Workflows: Productized APIs and plugins (MDSplus/EPICS/CODAC); streaming microservices (Kafka/GStreamer) for T–F analytics; open annotation standards
- Assumptions/Dependencies: Community buy-in for standards; sustainable maintenance; interoperability with legacy systems
Procurement and regulatory guidance for data-ready instrumentation
- Sectors: Policy, Energy (fusion), Environment
- Tools/Workflows: Guidelines to ensure future diagnostics or monitoring arrays produce AI-ready, synchronized, multi-channel T–F data; compliance checklists for environmental monitoring
- Assumptions/Dependencies: Coordination across labs/agencies; balancing data volume with cost; clear governance for data sharing

Key Cross-Cutting Assumptions and Dependencies

Noise model: Best denoising performance assumes zero-mean, independent noise across channels; performance may degrade otherwise.
Broadband baseline: Background must exhibit 1/f-like structure for baseline removal to be effective; parameters may require tuning per sensor.
Channel availability: Multichannel inputs improve denoising; single-channel operation is supported but less robust (mitigated by refinement).
Spectrogram configuration: STFT windowing and resampling must match the signal’s bandwidth and dynamics; multi-scale training helps but not a panacea.
Latency and compute: GPU accelerates to ~0.5 s per shot; CPU-only deployments may need quantization/pruning for tighter latency.
Thresholding: Knee-CDF thresholding requires presence of events; purely quiescent spectra may yield spurious sparse detections and need rule-based gating.
Domain transfer: Zero-shot generalization is promising (TJ-II, DCLDE) but fine-tuning typically improves precision and alignment with domain-specific annotation practices.

View Paper Prompt View All Prompts

Glossary

1/f noise: A type of signal noise where power spectral density scales inversely with frequency (often denoted 1/f^χ). "it is observed to follow a $1/f^{\chi}$ dropoff"
Alfvén eigenmodes (AE): Magnetically driven plasma oscillations associated with Alfvén waves in tokamaks/stellarators. "including coherent modes such as \acrfull{mhd} instabilities and \acrfull{ae},"
AP-BSN: A blind-spot denoising network variant that masks target pixels to avoid identity mapping and learn noise statistics. "Methods such as AP-BSN, which uses variable kernel sizes has been tested,"
asymmetric least squares: A baseline estimation method that penalizes positive residuals more to fit curved backgrounds in spectra. "We use the standard value of $\lambda = 10^6$ for asymmetric least squares, which balances fitting fidelity with smoothness constraints"
AT-BSN: A blind-spot denoising model variant optimized for efficiency relative to AP-BSN. "Blind-spot denoising example with AT-BSN, a more efficient form of AP-BSN."
beam emission spectroscopy (BES): A diagnostic measuring plasma density fluctuations via light from injected neutral beams. "and beam emission spectroscopy measurements from DIII-D."
BF-16 (bfloat16) precision: A 16-bit floating-point format with 8-bit exponent for efficient deep learning training/inference. "Training was done using \acrfull{bf}-16 precision."
BM3D: A state-of-the-art image denoising algorithm based on block matching and collaborative filtering in 3D transform domains. "Linear filters (e.g., Gaussian/Wiener filters, BM3D) tend to average out the faint, transient, and non-stationary signals"
Brownian noise: Integrated (1/f²⁾ stochastic noise with power increasing at lower frequencies. "also known as Brownian noise,"
Chebyshev Type I decimation: Downsampling preceded by an IIR Chebyshev Type I low-pass filter to control aliasing. "8th-order Chebyshev Type I decimation with phase preservation"
CO2 interferometer: A density diagnostic using CO2 laser interferometry to measure line-integrated electron density. "CO2 interferometers,"
correlation electron cyclotron emission (CECE): A dual-channel ECE technique that exploits correlated fluctuations to improve SNR. "such as \acrfull{cece} which effectively provides two measures of the same phenomena"
cross-power spectrum (CPS): The frequency-domain product of one signal and the complex conjugate of another, used to reveal coherent components. "similar to the classical \acrfull{cps}"
cumulative distribution function (CDF): The integral distribution function used here to locate a knee point for spectrogram thresholding. "the knee point of the spectrograms \acrfull{cdf}"
curvelet transform: A multiscale directional transform suited for sparsely representing edges/curves in images and spectrograms. "Conversely, sparse transforms (e.g., wavelet, curvelet) rely on strong priors"
DCLDE 2011 dataset: A benchmark bioacoustics dataset for detection/classification/localization of marine mammals. "we deploy it on a well known \acrfull{dclde} 2011 Ocodonte dataset"
electron cyclotron current drive (ECCD): A technique that drives plasma current using electron cyclotron waves to control instabilities. "199607 is \acrfull{eccd} supressed,"
electron cyclotron emission (ECE): Radiation emitted by electrons gyrating in magnetic fields, used to infer electron temperature/fluctuations. "We also check \acrshort{ece}."
electroencephalography (EEG): A neurophysiological measurement of brain electrical activity; used here for analogies to spectrogram event analysis. "which is widely used in \acrfull{eeg} signal processing"
edge localized modes (ELM): Bursty instabilities at the plasma edge causing transient transport events. "and transient modes such as \acrfull{elm} and neutral particle spikes,"
empirical mode decomposition (EMD): A data-driven method that decomposes signals into intrinsic mode functions for nonstationary analysis. "[e.g., \acrfull{emd}, \acrfull{vmd}]"
event-related spectral perturbation (ERSP): A measure of spectral changes time-locked to events; adapted here for baseline removal. "This is based on \acrfull{ersp} which is widely used in \acrfull{eeg} signal processing"
FAITH (Fusion Artificial Intelligence and Toolkit Hub): A Python toolkit and data hub for high-performance ML on fusion data. "Data is processed via \acrfull{faith}, a Python package and fusion database"
geodesic acoustic modes (GAM): Axisymmetric oscillations in toroidal plasmas linked to geodesic curvature and zonal flows. "These can include \acrfull{ntm}, \acrfull{gam} \acrshort{tm}, \acrshort{ae}, and kinks."
Hann window: A tapering function used in STFT to reduce spectral leakage. "processed using a Hann window ( $N=1024$ , overlap=87.5\%)"
harmonic-percussive separation (HPS): A technique separating horizontal (harmonic) and vertical (percussive) components in time-frequency data. "This problem resembles that of \acrfull{hps}"
HDF5: A hierarchical data format for large, complex datasets commonly used in scientific computing. "With the given starting HDF5 Dataset format,"
ITER: An international tokamak project aiming to demonstrate net energy gain from fusion. "Next-generation fusion facilities like ITER face a "data deluge,""
Kolmogorov-type cascades: Energy transfer across scales in turbulence following Kolmogorov’s theory, affecting spectral slopes. "roughly follows Kolmogorov-type cascades"
magnetohydrodynamic (MHD): The macroscopic theory of conducting fluids (like plasmas) in magnetic fields; describes many plasma instabilities. "including coherent modes such as \acrfull{mhd} instabilities"
Magnetics High Resolution (MHR): A high-resolution magnetic diagnostic used for fluctuation measurements. "This is demonstrated on four representative case studies: \acrfull{ece}, \acrfull{co2}, \acrfull{mhr}, and \acrfull{bes}"
mean absolute error (MAE): A regression loss measuring average absolute deviation, used for robust training. "The network is optimized by minimizing the \acrfull{mae}"
MODESPEC: A fusion workflow/tool for spectral analysis referenced for compatibility with the STFT pipeline. "compatibility with existing fusion workflows e.g., MODESPEC"
MUSIC (Multiple Signal Classification): A high-resolution subspace method for frequency/DOA estimation. "such as \acrfull{ssa} or MUSIC scale exponentially with channel count,"
neoclassical tearing modes (NTM): Magnetic islands driven by neoclassical effects, a key class of MHD instabilities. "These can include \acrfull{ntm}, \acrfull{gam} \acrshort{tm}, \acrshort{ae}, and kinks."
Noise2Noise: A self-supervised denoising paradigm learning to map noisy inputs to noisy targets to recover signal. "we can combine this with a scheme closer to Self-inspired Noise2Noise for full coverage"
Otsu (thresholding method): An automatic global thresholding technique assuming a bimodal histogram; ill-suited for sparse spectrograms. "Standard image thresholding methods such as Otsu are designed for bimodal distributions"
pre-emphasis filter: A frequency pre-weighting to counteract low-frequency dominance or color, often f^α. "a pre-emphasis filter which follows the equation"
short-time Fourier transform (STFT): A time-frequency transform applying the Fourier transform on windowed signal segments. "Therefore we utilize the \acrfull{stft} for its computational efficiency"
singular spectrum analysis (SSA): A decomposition method using trajectory matrices and SVD for trend/noise separation. "such as \acrfull{ssa} or MUSIC scale exponentially with channel count,"
Slepian-based multitaper methods: Spectral estimation using multiple orthogonal tapers (Slepian sequences) to reduce variance/leakage. "While wavelet decomposition and Slepian-based multitaper methods offer theoretical advantages,"
SpecAug (SpecAugment): Data augmentation on spectrograms (time/frequency masking) to improve model robustness. "specifically SpecAug"
stellarator: A toroidal magnetic confinement device shaped to confine plasma without large plasma currents. "from the TJ-II stellarator in Spain."
tearing modes (TM): MHD instabilities where magnetic reconnection forms islands, degrading confinement. "These can include \acrfull{ntm}, \acrfull{gam} \acrshort{tm}, \acrshort{ae}, and kinks."
tokamak: A toroidal magnetic confinement device using strong plasma current for fusion research. "Understanding the complex internal state of a tokamak relies on interpreting a vast array of diagnostic signals."
total variation (TV): A regularization/metric measuring signal/image total gradient, used here as a stopping criterion. "we employed a \acrfull{tv} stopping criterion"
U-Net: An encoder–decoder convolutional neural network with skip connections for segmentation/denoising. "In this paper, we have used a U-Net three separate times."
variational mode decomposition (VMD): A decomposition method that extracts band-limited intrinsic modes via variational optimization. "[e.g., \acrfull{emd}, \acrfull{vmd}]"
Welch periodogram: A spectral estimator averaging periodograms over segments to reduce variance. "Welch periodogram, which is the time averaged spectrogram."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Collections

GitHub

GitHub - PlasmaControl/tokeye: TokEye is a open-source Python-based application for automatic classification and localization of fluctuating signals. It is designed to be used in the context of plasma physics, but can be used for any type of fluctuating signal. (4 stars)

Tweets

After spending a year on this, I finally made a label-free way to automatically isolate any events in any noisy spectrogram with <1s latency. I’m really excited to get the community's thoughts. (29 points, 3 comments)

Fast Spectrogram Event Extraction via Offline Self-Supervised Learning: From Fusion Diagnostics to Bioacoustics

Summary

Fast Spectrogram Event Extraction via Offline Self-Supervised Learning: Signal Processing from Fusion Diagnostics to Bioacoustics

Introduction and Motivation

Taxonomy of Signal Modes

Signal Processing Pipeline

Event Segmentation and Label Refinement

Surrogate Neural Representation

Empirical Evaluation

Theoretical and Practical Implications

Contrasts with Prior Art and Strong Claims

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it?

Turning signals into pictures (spectrograms)

Separating the background “hum” (baseline removal)

Using many sensors together (multichannel denoising)

Picking out the events (smart thresholding)

Training a fast helper model (surrogate)

What did they find?

Why is this important?

Takeaway

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Key Cross-Cutting Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Collections

GitHub

Tweets

Reddit