Layer Fusion-Separation Blocks (LFSB)

Updated 31 January 2026

LFSB is a neural network module that alternates explicit fusion and separation to disentangle and recombine feature representations from distinct sources.
It integrates techniques like cross-stream projection, attention mechanisms, and residual feed-forward networks to improve tasks such as image reflection suppression and speech recognition.
Empirical results demonstrate that LFSBs enhance performance metrics like PSNR and WER by mitigating interstream confusion and promoting robust feature disentanglement.

A Layer Fusion–Separation Block (LFSB) is a neural network architectural module designed for disentangling and recombining feature representations corresponding to distinct layers or sources within a signal. LFSBs have become central to recent advances in visual layer separation, single-image reflection suppression, burst image fusion, and foundation model representation learning in speech. Their core principle is to alternate explicit information sharing (fusion) with explicit disentanglement (separation), mitigating confusion between interdependent signal components under nonlinear or otherwise entangled observation models. LFSB designs are instantiated in hybrid encoder-decoder networks for image decomposition as well as convolutional and attention-based interfaces for multi-model speech processing (Lee et al., 24 Jan 2026, Shih et al., 11 Nov 2025, Nam et al., 2021).

1. Architectural Foundations

LFSBs are characterized by their dual-stream or multi-stream structure. Each stream processes features associated with a specific source, such as transmission versus reflection in images, or outputs of different deep models/layers in speech. At each stage, the LFSB module receives stream-specific feature tensors, fuses information via projection and/or attention, applies learnable mechanisms for inter-stream disentanglement, and emits updated features for downstream processing (Lee et al., 24 Jan 2026, Nam et al., 2021, Shih et al., 11 Nov 2025).

In ReflexSplit for single-image reflection separation, each LFSB at decoder level $\ell$ takes as input two feature tensors $F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ , corresponding to transmission and reflection, and outputs $F_{\ell+1}^t, F_{\ell+1}^r$ . Internal modules include:

Early Fusion (bidirectional cross-stream projection),
Differential Dual-Dimensional Attention,
Late Fusion (feed-forward residual subnetwork).

A related paradigm in neural implicit image representations uses MLP-based functions for fusing multiple perturbed or occluded image frames into a single, coordinate-based canonical view and then separating constituent layers via auxiliary streams (Nam et al., 2021). For speech, LFSBs are implemented as interface modules that hierarchically fuse hidden layer activations from multiple foundation models, followed by final task-specific projections (Shih et al., 11 Nov 2025).

2. Mathematical Formulations and Operations

The LFSB applies a progression of operations alternately emphasizing fusion and separation.

Early Fusion (Cross-Stream Projection):

$F_\ell^{t'} = W^t [F_\ell^t \parallel F_\ell^r],\quad F_\ell^{r'} = W^r [F_\ell^r \parallel F_\ell^t]$

$[\,\cdot\,\parallel\,\cdot\,]$ denotes channel-wise concatenation; $W^t$ , $W^r$ are learnable.

Dual-Dimensional Attention:
- Self-Attention (SA) across batch.
- Cross-Attention (CA) across sequence.
- Outputs $A_{SA}^t, A_{SA}^r, A_{CA}^t, A_{CA}^r$ .
Differential Separation:

$A_{diff}^t = (A_{SA}^t + A_{CA}^t) - \sigma(\lambda_\ell)(A_{SA}^r + A_{CA}^r)$

$A_{diff}^r = (A_{SA}^r + A_{CA}^r) - \sigma(\lambda_\ell)(A_{SA}^t + A_{CA}^t)$

$F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ 0 is a learnable scalar, $F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ 1 is sigmoid.

Late Fusion (Residual Feed-Forward):

$F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ 2

Forward Pass Pseudocode

$F_{\ell+1}^t, F_{\ell+1}^r$ 4

Motion-aligned canonical view fusion:

$F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ 3

with $F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ 4 a learned mapping (homography or flow), $F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ 5 the scene stream, $F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ 6 the dynamic/interference stream.

Principal loss:

$F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ 7

Merge step (additive fusion):

$F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ 8

Hierarchical 1D convolution along the layer axis forms $F_\ell^t, F_\ell^r \in \mathbb{R}^{H_\ell\times W_\ell\times C}$ 9.
Fused features $F_{\ell+1}^t, F_{\ell+1}^r$ 0 are directly routed to downstream task heads.

3. Training, Regularization, and Curriculum Strategies

In imaging tasks, separation strength is governed by learnable per-layer scalars $F_{\ell+1}^t, F_{\ell+1}^r$ 1, regulated with curriculum training—initializing low and increasing to progressively emphasize disentanglement over decoder depth or training epochs. ReflexSplit employs epoch-wise warmup and depth-dependent initialization to modulate the cross-stream cancellation term and stabilize gradient flow. Losses include pixel-level reconstruction, interference sparsity ( $F_{\ell+1}^t, F_{\ell+1}^r$ 2 regularization on residual features), exclusion loss (decorrelating spatial gradients of streams), and smoothness penalties for motion fields (Lee et al., 24 Jan 2026, Nam et al., 2021).

In speech, the upstream model weights are typically frozen; only the fusion–separation interface and downstream projection heads are trained. Loss is dictated by the downstream task (e.g., CTC for ASR, binary cross-entropy for speaker verification) (Shih et al., 11 Nov 2025).

4. Empirical Performance and Ablative Analyses

LFSBs deliver quantifiable improvements across domains.

Model/Setting	Baseline	LFSB-integrated	Metric/Improvement
ReflexSplit, Real20 PSNR	no LFSB	LFSB	PSNR: 19.95 dB → +1.3 dB
ReflexSplit, NCC (confusion)	no LFSB	LFSB	0.9254 (↑ confusion)
Speech (LibriSpeech, WER)	WS	HConv (LFSB)	6.32% → 5.80% (single-model)
Speech (LibriSpeech, WER)	WS	CHConv (LFSB)	5.42% → 4.82% (two-model)

Ablation in ReflexSplit shows LFSB reduces feature-space interstream confusion (NCC) and increases PSNR. t-SNE and PCA analyses document a progressive and structured disentanglement across layers, with LFSB yielding richer, higher-dimensional representations in deep layers—strongly correlating with improved generalization and robustness (Lee et al., 24 Jan 2026). For speech, LFSB fusion modules (HConv/CHConv) consistently outperform baseline fusion techniques (weighted sum, Gumbel-softmax selection), and concatenative fusion offers marginal but consistent gains at higher parameter costs (Shih et al., 11 Nov 2025).

5. Domain-Specific Variants and Implementation Details

Image Layer Separation

ReflexSplit's LFSB implements fusion and separation at multiple decoder scales, alternating the two to balance shared structure extraction with inter-layer isolation. In coordinate-based NIR fusion (Nam et al., 2021), the LFSB paradigm takes the form of MLPs with scene-aligned inputs, enabling unsupervised fusion into a canonical view and explicit separation of dynamic/interference layers.

Speech Foundation Models

LFSB is instantiated as a fusion–separation interface operating over multiple models and layers. Key variants are:

HConv (additive): summing aligned model/layer feature maps.
CHConv (concatenative): concatenating then convolving across layers. After hierarchical 1D convolutional fusion, features are sent to task heads without additional separation; the "separation" is limited to final projections by the downstream head.

6. Theoretical Insights and Representational Advantages

Alternating fusion and separation within an LFSB counteracts under- or over-sharing between streams. Immediate separation after fusion blocks the bleed-through of non-target cues and enforces layer specificity, but repeated fusion allows crucial global structure to be shared and utilized, especially under nonlinear mixing (e.g., $F_{\ell+1}^t, F_{\ell+1}^r$ 3). This prevents premature stream collapse or excessive independence, both of which degrade disentanglement in deep models (Lee et al., 24 Jan 2026). Residual feed-forward components stabilize gradient propagation, and progressive curriculum on separation makes deep coordination tractable.

Richer intermediate features, as shown by t-SNE/PCA, and qualitative reduction in artifacts such as color bleeding and residual reflections attest to the effectiveness of LFSB frameworks in both reconstruction fidelity and interpretability (Lee et al., 24 Jan 2026, Nam et al., 2021). In speech processing, strong gains are achievable by leveraging the model/layer fusion inductive bias for more effective use of diverse upstream knowledge (Shih et al., 11 Nov 2025).

7. Applications and Evaluation Protocols

LFSBs have demonstrated effectiveness in:

Single-image reflection/transmission separation with high perceptual and quantitative accuracy (Lee et al., 24 Jan 2026).
Multi-frame fusion for burst denoising, de-moiréing, super-resolution, occlusion removal, and dynamic scene analysis, using unsupervised NIR-driven architectures (Nam et al., 2021).
Foundation model fusion in speech, improving downstream ASR, speaker verification, and emotion recognition performance (Shih et al., 11 Nov 2025).

Evaluation metrics include PSNR, SSIM, NCC, and structure indices for image tasks, and word error rate (WER), character error rate (CER), equal error rate (EER), and accuracy for speech, under standardized datasets and controlled ablation protocols. No LFSB designs introduce post-processing or reliance on learned generative priors (Lee et al., 24 Jan 2026, Nam et al., 2021).

In summary, the Layer Fusion–Separation Block is a modular pattern that underpins state-of-the-art layer disentanglement and fusion across image and speech tasks, balancing shared representation learning with robust, stream-specific feature refinement. Its design directly addresses the core challenge of nonlinear source entanglement, enabling new performance levels in signal separation and representation integration (Lee et al., 24 Jan 2026, Shih et al., 11 Nov 2025, Nam et al., 2021).

Markdown Report Issue Upgrade to Chat

References (3)

ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation (2026)

Unifying Model and Layer Fusion for Speech Foundation Models (2025)

Neural Image Representations for Multi-Image Fusion and Layer Separation (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Layer Fusion-Separation Blocks (LFSB).

Layer Fusion-Separation Blocks (LFSB)

1. Architectural Foundations

2. Mathematical Formulations and Operations

Image Reflection Separation (ReflexSplit LFSB) (Lee et al., 24 Jan 2026):

Multi-Image Fusion and Layer Separation (Nam et al., 2021):

Speech Foundation Models (Shih et al., 11 Nov 2025):

3. Training, Regularization, and Curriculum Strategies

4. Empirical Performance and Ablative Analyses

5. Domain-Specific Variants and Implementation Details

Image Layer Separation

Speech Foundation Models

6. Theoretical Insights and Representational Advantages

7. Applications and Evaluation Protocols

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Layer Fusion-Separation Blocks (LFSB)

1. Architectural Foundations

2. Mathematical Formulations and Operations

Image Reflection Separation (ReflexSplit LFSB) (Lee et al., 24 Jan 2026):

Multi-Image Fusion and Layer Separation (Nam et al., 2021):

Speech Foundation Models (Shih et al., 11 Nov 2025):

3. Training, Regularization, and Curriculum Strategies

4. Empirical Performance and Ablative Analyses

5. Domain-Specific Variants and Implementation Details

Image Layer Separation

Speech Foundation Models

6. Theoretical Insights and Representational Advantages

7. Applications and Evaluation Protocols

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics