Dynamic Spectral Contour in Audio Restoration

Updated 23 January 2026

Dynamic Spectral Contour (DSC) is a time-varying control signal that tracks the spectral edge, allowing precise, framewise specification of bandwidth targets in audio restoration.
It employs a method combining magnitude STFT, binary masking, Gaussian and median filtering to extract robust spectral edges from degraded audio signals.
When integrated into flow-matching restoration models, DSC achieves superior fidelity and control adherence compared to traditional spectral descriptors like centroid and roll-off.

Dynamic Spectral Contour (DSC) is a time-varying scalar control signal, introduced as an interpretable conditioning mechanism for precise bandwidth extension in audio restoration tasks. DSC enables frame-wise specification of the spectral edge (the highest frequency bin with significant energy) in a degraded audio signal, supporting fine-grained control unattainable with traditional spectral descriptors. Its deployment within conditional generative restoration frameworks, specifically single-step flow-matching architectures, yields superior output fidelity and control adherence when compared to alternatives such as spectral centroid and roll-off (Hernandez-Olivan et al., 20 Jan 2026).

1. Definition and Motivation

DSC, denoted $c_{\rm dsc}(t)$ at frame $t$ , tracks the spectral edge of an audio waveform—the point in the frequency domain beyond which energy is negligible. Motivated by shortcomings in global spectral features for restoration (notably erratic behavior in low-energy/silent regions), DSC was developed to allow intuitive, frame-wise specification of bandwidth targets during audio restoration. By conditioning the restoration model with DSC, practitioners gain precise, temporally-resolved control over the reconstructed signal’s upper frequency limit, avoiding undesirable excursions seen with conventional global features.

2. Mathematical Formulation

DSC is computed from a discrete-time audio signal $x[n]$ , sampled at $f_s$ . The methodology involves:

Magnitude STFT calculation:

$X(f,t) = |\mathrm{STFT}\{x\}[f,t]|$

for $f \in \{0,\dots,N_{\rm fft}/2\}$ , $t \in \{1,\dots,T\}$ .

Binary masking with energy threshold $q = 10^{-1.6}$ :

$M(f,t) = \mathbf{1}\{ X(f,t) > q \}$

Gaussian frequency smoothing ( $\sigma_f=9$ ):

$\widetilde{M}(f,t) = (G_{\sigma_f} *_{f} M(\cdot, t))(f)$

Frequency cutoff extraction, selecting the smallest bin $f$ where smoothed mask falls below $\gamma=0.07$ :

$s_0(t) = \min\{ f : \widetilde{M}(f,t) < \gamma \}$

Temporal smoothing by median filtering ( $m_f=9$ ):

$c_{\rm dsc}(t) = \mathrm{MedianFilter}\bigl(s_0(\cdot), m_f\bigr)(t)$

Conversion to Hertz is performed post hoc, multiplying $c_{\rm dsc}(t)$ by $f_s / N_{\rm fft}$ .

This pipeline mitigates spurious peak detection and ensures robust, temporally stable output, even in low-energy frames.

3. Conditioning Flow-Matching Restoration Models

DSC is integrated into the conditional flow-matching (CFM) framework, building on the FLowHigh approach. Training involves feeding the model:

Narrow-band mel-spectrogram $x_0$ (from the degraded low-pass input).
Control matrix $c \in \mathbb{R}^{m \times T}$ comprising conditioning signals (DSC, centroid, roll-off, etc.).

A learned vector field $v_\theta(x_t, t, c)$ drives noisy/interpolated spectrograms toward their full-band targets. With probability $p$ , the control input is dropped, facilitating both conditional and unconditional field learning.

During generation, classifier-free guidance (cfg-zero★) produces a guided velocity: $\hat{v} = (1-w) s^\star v_\theta(x_t, t, \emptyset) + w v_\theta(x_t, t, c)$ where $w > 1$ is the guidance weight, and

$s^\star = \frac{ \langle v_\theta(x_t, t, c), v_\theta(x_t, t, \emptyset) \rangle }{ \| v_\theta(x_t, t, \emptyset) \|_2^2 }$

aligns vector field magnitudes. Integrating $\hat{v}$ yields mel-spectrograms adhering to the frame-wise DSC bandwidth profile and preserving low-frequency fidelity.

4. Model Architecture and Objective

The architecture used for DSC-conditioned flow matching features:

A two-layer Transformer, 35.4 million parameters, 16 attention heads, embedding dimension 1024, and FFN dimension 4096.
Input: $x_0$ (narrow-band mel-spectrogram), and control matrix $c$ .
Loss function: Standard flow-matching objective:

$\mathcal{L}(\theta) = \mathbb{E}_{x_0, x_1, t} \left\| v_\theta(x_t, t, c) - v^\star(x_t, t) \right\|_2^2$

plus an unconditional branch ( $c=\emptyset$ ), with no explicit control signal penalty.

The model’s full-band prediction is rendered to the time domain using a frozen BigVGAN vocoder, with sub-DSC target frequencies ( $< f_c$ ) sourced directly from input to suppress artifacts.

5. Comparative Ablation and Quantitative Evaluation

Empirical analysis demonstrates the superiority of DSC over spectral centroid and roll-off for single-step bandwidth extension. In experiments on 4 kHz low-pass test data, models under pure conditioning ( $w=1$ ) and guided sampling ( $w=3$ ), DSC consistently delivers the lowest spectral distance (LSD) and the tightest adherence between output and control contour.

Control

FAD

_{\rm CLAP}\downarrow

LSD (dB)

\downarrow

Abs.

|\log \hat{c} - \log c|

\downarrow

</th> </tr> </thead><tbody><tr> <td>Centroid</td> <td>0.41</td> <td>4.04</td> <td>1.41</td> </tr> <tr> <td>Roll-off</td> <td>0.19</td> <td>1.69</td> <td>0.30</td> </tr> <tr> <td>DSC</td> <td>0.12</td> <td>0.99</td> <td>0.18</td> </tr> </tbody></table></div> <p>Manipulation of the DSC, via scaling (e.g., by 0.5 or 2.0), yields systematic modulation of bandwidth in the output, tightly tracking the control except for attempts beyond the model’s training support, which result in degradation and artifacts.</p> <h2 class='paper-heading' id='practical-deployment-and-guidelines'>6. Practical Deployment and Guidelines</h2> <p>Implementation of DSC requires standard STFT parameters (

n_{\mathrm{fft}}=2048

, hop=512), energy threshold

q=10^{-1.6}

, Gaussian smoothing (

\sigma_f=9

), cutoff

\gamma=0.07

, and temporal median filter (

m_f=9

). Throughout, DSC in hertz/bin index is derived from the actual spectral edge of the reference signal.</p> <p>Framewise DSC is preferable for temporally-resolved control, outperforming globally-averaged spectral features in bandwidth-extension scenarios. Guidance weight

w \approx 3$ yields a balanced trade-off between bandwidth adherence and overall audio quality.

Operationally, scaling DSC yields intermediate restoration effects, but setting targets above the natural maximum induces artifacts, suggesting model limitations beyond its training support.

7. Contextual Significance and Recommended Use

DSC provides a lightweight, interpretable, and pluggable method for specifying instantaneous bandwidth extension profiles in generative audio restoration, facilitating direct user control without the instability of traditional features. Empirical findings demonstrate that DSC-integrated, flow-matching restoration achieves competitive or superior performance compared to legacy conditioning approaches, especially for archives or materials requiring temporal, spectral preservation (Hernandez-Olivan et al., 20 Jan 2026).

A plausible implication is that DSC’s conceptual simplicity and robust empirical behavior may generalize to other audio restoration tasks necessitating fine-grained spectral control, supporting future research into controllable, high-fidelity generative modeling of temporally complex audio signals.

Markdown Report Issue Upgrade to Chat

References (1)

Single-step Controllable Music Bandwidth Extension With Flow Matching (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Spectral Contour (DSC).