Time-Domain Masking and Demasking Methods

Updated 15 January 2026

Time-domain masking and demasking methods are signal processing techniques that isolate overlapping components directly from waveform frames.
They leverage deep neural architectures like Conv-TasNet and statistical models such as GAM to perform precise source separation and intelligibility enhancement.
These approaches offer lower latency, enhanced computational efficiency, and improved performance over traditional time-frequency methods.

Time-domain masking and demasking methods constitute a class of signal processing and machine learning techniques where overlapping or interfering components are isolated, suppressed, or estimated directly in the temporal (waveform) domain, as opposed to the more traditional time-frequency (spectrogram or filterbank) domain. These approaches have been successfully developed for single- and multi-channel speech separation, music source separation, physiological signal demasking, and intelligibility enhancement under reverberant or corrupted conditions. Distinct from frequency-domain masking, these methods operate through either direct, frame-wise multiplicative masking in a learned or latent time-domain space, or via statistical demasking through subtraction of modeled interfering effects.

1. Principles and Formalism of Time-Domain Masking and Demasking

In time-domain masking for audio, as exemplified by Conv-TasNet, the input signal is segmented into overlapping frames. Each frame is encoded via a learned 1-D convolution with $N$ basis functions, yielding a latent representation $w_k \in \mathbb{R}^N$ . For a mixture $x[n]$ , the encoder maps frames $x_k$ to latent vectors $w_k = Enc(x_k) = H(x_k \ast U)$ , with $U \in \mathbb{R}^{N \times L}$ . A mask-estimation network—typically a Temporal Convolutional Network (TCN)—then estimates frame-wise, multiplicative masks $M_i \in [0,1]^{N \times K}$ for $C$ sources. Source separation is performed by multiplying each mask with the encoder output: $D_i = M_i \odot W$ . The masked features are decoded back to source waveforms $\hat{S}_i$ via transposed convolution with a learned decoder matrix $V$ (Luo et al., 2018).

In physiological signal analysis, such as circadian demasking, the observed measurement (e.g., core body temperature) is modeled as a sum: $\mathrm{CBT}(t_i) = B_0 + f_c(t_i; \theta) + f_s(t_i; \phi_s) + f_a(\mathrm{HR}_i; \phi_a) + \varepsilon_i$ where $f_c$ is the circadian component, $f_s$ is the sleep-wake masking component, and $f_a$ is the activity-based masking term. Demasking proceeds by estimating and subtracting the non-circadian masking effects to recover $\widehat{\mathrm{CBT}}_{\mathrm{circ}}$ (Nguyen et al., 2024).

Time-domain demasking for speech intelligibility improvement (ARA_NSD) leverages a frame-wise non-stationarity index to identify masked segments. These segments are adaptively attenuated in the waveform domain based on the degree of masking as estimated by non-stationarity measures, with all processing occurring directly on the signal frames (Zucatelli et al., 2019).

2. Architectures and Algorithmic Realizations

Deep Neural Time-Domain Masking

Conv-TasNet employs a fully-convolutional encoder–mask–decoder architecture. The encoder is a 1-D convolution, the mask estimator is a deep TCN with stacked 1-D dilated convolutional blocks, and the decoder is a transposed 1-D convolution. The TCN comprises R repeats of M blocks, with each block employing depthwise-separable convolutions, PReLU activations, and LayerNorm. Residual and skip connections are used as in WaveNet. The masking operation is element-wise: $D_i[n, k] = M_i[n, k] \cdot W[n, k]$

HTMD-Net extends this paradigm by appending a denoising ("demasking") module after masking. The initial mask-based source estimate is further refined by an encoder–decoder U-Net with skip connections and a bidirectional LSTM bottleneck, reducing artifacts from masking and increasing suppression in silent regions. All transformations remain in the time domain, with normalization and activation choices adapted for stability and efficiency (Garoufis et al., 2021).

Statistical and Adaptive Demasking

In circadian analysis, masking and demasking are formulated within a Generalized Additive Model (GAM), jointly fitting circadian, sleep-wake, and activity-related components. Parametric and spline-based terms model the various masking phenomena, and residuals correspond to unexplained variance. Demasking is achieved by explicit subtraction of fitted masking terms from raw measurements to reveal the intrinsic circadian trajectory (Nguyen et al., 2024).

ARA_NSD for reverberation absorption segments the input into overlapping "Reverberation Groups" and measures the Index of Non-Stationarity (INS) across multiple scales. For segments identified as heavily masked (low $\delta_{INS}$ ), an adaptive gain, governed by a two-branch sigmoid law and updated upper bounds per group, is applied in the time domain to suppress masking effects. All segmentation, measurement, and gain application steps are performed directly on waveform frames (Zucatelli et al., 2019).

Approach	Masking Basis	Demasking Mechanism
Conv-TasNet	Encoder latent space	Multiplicative masking, decoding
HTMD-Net	Encoder latent space	Masking + denoising U-Net
Circadian GAM	Additive model	Subtraction of fitted masking
ARA_NSD	Waveform frames	Adaptive absorption gain

3. Evaluation Metrics and Empirical Performance

Separation and Intelligibility

Conv-TasNet achieves SI-SNR improvement of 15.3 dB and SDRi 15.6 dB on WSJ0-2mix, surpassing ideal time-frequency masks such as IRM (12.2 dB), IBM (13.0 dB), and WFM (13.4 dB) (Luo et al., 2018). Model size is 5.1 M parameters, latency is 2 ms, and processing is real-time on CPU. HTMD-Net, on musdb18, achieves song-wise SDR = 5.16 dB and SIR = 10.24 dB, matching or exceeding Conv-TasNet on SIR and improving suppression in silent segments (lower PES: –62.2 dB vs –57.8 dB), with 4.5 M parameters and greater computational efficiency (Garoufis et al., 2021).

ARA_NSD demonstrates superior intelligibility improvement over competing methods across both objective (ESII, ASIIST, SRMRnorm) and perceptual measures. For example, in the Stairway environment ( $T_{60}=1.1$ s), $\Delta$ ESII is +12.4 × $10^{-2}$ , and SRMR $_{norm}$ is 0.85 (vs. 0.69 for unprocessed). Intelligibility percentages at SNR = 0 dB increase from 52% for unprocessed to 60% with ARA_NSD (Zucatelli et al., 2019).

Physiology Demasking Accuracy

In circadian demasking, the recosinor GAM model achieves Pearson $R = 0.90$ [0.83–0.96] between fitted and observed CBT, compared to 0.81 [0.55–0.93] for cosinor + 1st harmonic methods. The bias in Tmin (estimated minus “measured”) is 0.2 h (–0.5 to 0.3 h) for the new model, compared to 1.4 h (1.1 to 1.7 h) for traditional approaches (Nguyen et al., 2024).

4. Applications and Specific Domains

Time-domain masking/demasking has demonstrated success in several domains:

Speech Separation: Conv-TasNet and HTMD-Net demonstrate that deep, fully time-domain frameworks can surpass time-frequency masking in accuracy, computational efficiency, and latency for single-channel, speaker-independent separation, as well as music source separation (Luo et al., 2018, Garoufis et al., 2021).
Intelligibility Enhancement: ARA_NSD improves intelligibility in noisy-reverberant speech by detecting and selectively attenuating masked waveform regions, independent of room or noise statistics (Zucatelli et al., 2019).
Circadian Rhythm Demasking: GAM-based demasking enables higher-fidelity extraction of circadian phase and amplitude in physiological recordings by removing quantifiable, non-circadian masking effects such as sleep-wake transitions and activity-induced temperature changes (Nguyen et al., 2024).

5. Assumptions, Limitations, and Model Extensions

Conv-TasNet’s time-domain masking approach presumes that a learned encoder, mask estimator, and decoder can jointly discover an optimal latent space for multiplicative masking. Masking is based on information present in the encoded frames, and performance depends on the capacity and inductive bias of the TCN.

HTMD-Net directly improves on conventional masking by augmenting with a denoising U-Net, but effectiveness can be limited by denoiser depth, bottleneck representation, and mismatch between denoiser and masking artifacts (Garoufis et al., 2021).

Statistical demasking for circadian signals assumes that sleep-wake masking can be captured as a single parametric gamma impulse per transition, and proxies like heart rate adequately represent all activity-related masking. Population and context generalization require further study beyond healthy adults under controlled conditions (Nguyen et al., 2024).

ARA_NSD assumes that non-stationarity adequately indexes masked regions and that adaptive attenuation is preferable to estimation or inversion. Performance is highly parameter-dependent, with hand-tuned sigmoid laws and thresholds held fixed across speakers, rooms, and noise conditions (Zucatelli et al., 2019).

6. Comparative Summary and Significance

Time-domain masking and demasking methods advance the state of the art in source separation, intelligibility enhancement, and physiological signal analysis by leveraging direct temporal representations, learned or explicit. These approaches unify detection and suppression of overlapping or interfering sources via learned masking, adaptive attenuation, or additive modeling, offering distinct advantages in latency, model size, and accuracy across modalities.

A plausible implication is that, as learning architectures and statistical models for masking/demasking continue to diversify, time-domain methodologies will increasingly supplant frequency-domain baselines, especially in applications demanding low latency, real-time processing, or explicit interpretability of masking sources. Performance benchmarks, especially those cited above, substantiate this shift and provide rigorous empirical support for continued development in this domain (Luo et al., 2018, Garoufis et al., 2021, Zucatelli et al., 2019, Nguyen et al., 2024).