Layer Fusion-Separation Blocks (LFSB)
- LFSB is a neural network module that alternates explicit fusion and separation to disentangle and recombine feature representations from distinct sources.
- It integrates techniques like cross-stream projection, attention mechanisms, and residual feed-forward networks to improve tasks such as image reflection suppression and speech recognition.
- Empirical results demonstrate that LFSBs enhance performance metrics like PSNR and WER by mitigating interstream confusion and promoting robust feature disentanglement.
A Layer Fusion–Separation Block (LFSB) is a neural network architectural module designed for disentangling and recombining feature representations corresponding to distinct layers or sources within a signal. LFSBs have become central to recent advances in visual layer separation, single-image reflection suppression, burst image fusion, and foundation model representation learning in speech. Their core principle is to alternate explicit information sharing (fusion) with explicit disentanglement (separation), mitigating confusion between interdependent signal components under nonlinear or otherwise entangled observation models. LFSB designs are instantiated in hybrid encoder-decoder networks for image decomposition as well as convolutional and attention-based interfaces for multi-model speech processing (Lee et al., 24 Jan 2026, Shih et al., 11 Nov 2025, Nam et al., 2021).
1. Architectural Foundations
LFSBs are characterized by their dual-stream or multi-stream structure. Each stream processes features associated with a specific source, such as transmission versus reflection in images, or outputs of different deep models/layers in speech. At each stage, the LFSB module receives stream-specific feature tensors, fuses information via projection and/or attention, applies learnable mechanisms for inter-stream disentanglement, and emits updated features for downstream processing (Lee et al., 24 Jan 2026, Nam et al., 2021, Shih et al., 11 Nov 2025).
In ReflexSplit for single-image reflection separation, each LFSB at decoder level takes as input two feature tensors , corresponding to transmission and reflection, and outputs . Internal modules include:
- Early Fusion (bidirectional cross-stream projection),
- Differential Dual-Dimensional Attention,
- Late Fusion (feed-forward residual subnetwork).
A related paradigm in neural implicit image representations uses MLP-based functions for fusing multiple perturbed or occluded image frames into a single, coordinate-based canonical view and then separating constituent layers via auxiliary streams (Nam et al., 2021). For speech, LFSBs are implemented as interface modules that hierarchically fuse hidden layer activations from multiple foundation models, followed by final task-specific projections (Shih et al., 11 Nov 2025).
2. Mathematical Formulations and Operations
The LFSB applies a progression of operations alternately emphasizing fusion and separation.
Image Reflection Separation (ReflexSplit LFSB) (Lee et al., 24 Jan 2026):
- Early Fusion (Cross-Stream Projection):
denotes channel-wise concatenation; , are learnable.
- Dual-Dimensional Attention:
- Self-Attention (SA) across batch.
- Cross-Attention (CA) across sequence.
- Outputs .
- Differential Separation:
is a learnable scalar, is sigmoid.
- Late Fusion (Residual Feed-Forward):
Forward Pass Pseudocode
1 2 3 4 5 6 7 8 9 10 11 |
def LFSB_forward(F_t, F_r, lambda_l): F_t_prime = Linear_t(concat_channels(F_t, F_r)) F_r_prime = Linear_r(concat_channels(F_r, F_t)) A_t_SA, A_r_SA = SelfAttention(concat_batch(F_t_prime, F_r_prime)) A_t_CA, A_r_CA = CrossAttention(concat_sequence(F_t_prime, F_r_prime)) w = sigmoid(lambda_l) A_t_diff = (A_t_SA + A_t_CA) - w * (A_r_SA + A_r_CA) A_r_diff = (A_r_SA + A_r_CA) - w * (A_t_SA + A_t_CA) F_t_next = F_t + FFN(A_t_diff) F_r_next = F_r + FFN(A_r_diff) return F_t_next, F_r_next |
Multi-Image Fusion and Layer Separation (Nam et al., 2021):
- Motion-aligned canonical view fusion:
with a learned mapping (homography or flow), the scene stream, the dynamic/interference stream.
- Principal loss:
Speech Foundation Models (Shih et al., 11 Nov 2025):
- Merge step (additive fusion):
- Hierarchical 1D convolution along the layer axis forms .
- Fused features are directly routed to downstream task heads.
3. Training, Regularization, and Curriculum Strategies
In imaging tasks, separation strength is governed by learnable per-layer scalars , regulated with curriculum training—initializing low and increasing to progressively emphasize disentanglement over decoder depth or training epochs. ReflexSplit employs epoch-wise warmup and depth-dependent initialization to modulate the cross-stream cancellation term and stabilize gradient flow. Losses include pixel-level reconstruction, interference sparsity ( regularization on residual features), exclusion loss (decorrelating spatial gradients of streams), and smoothness penalties for motion fields (Lee et al., 24 Jan 2026, Nam et al., 2021).
In speech, the upstream model weights are typically frozen; only the fusion–separation interface and downstream projection heads are trained. Loss is dictated by the downstream task (e.g., CTC for ASR, binary cross-entropy for speaker verification) (Shih et al., 11 Nov 2025).
4. Empirical Performance and Ablative Analyses
LFSBs deliver quantifiable improvements across domains.
| Model/Setting | Baseline | LFSB-integrated | Metric/Improvement |
|---|---|---|---|
| ReflexSplit, Real20 PSNR | no LFSB | LFSB | PSNR: 19.95 dB → +1.3 dB |
| ReflexSplit, NCC (confusion) | no LFSB | LFSB | 0.9254 (↑ confusion) |
| Speech (LibriSpeech, WER) | WS | HConv (LFSB) | 6.32% → 5.80% (single-model) |
| Speech (LibriSpeech, WER) | WS | CHConv (LFSB) | 5.42% → 4.82% (two-model) |
Ablation in ReflexSplit shows LFSB reduces feature-space interstream confusion (NCC) and increases PSNR. t-SNE and PCA analyses document a progressive and structured disentanglement across layers, with LFSB yielding richer, higher-dimensional representations in deep layers—strongly correlating with improved generalization and robustness (Lee et al., 24 Jan 2026). For speech, LFSB fusion modules (HConv/CHConv) consistently outperform baseline fusion techniques (weighted sum, Gumbel-softmax selection), and concatenative fusion offers marginal but consistent gains at higher parameter costs (Shih et al., 11 Nov 2025).
5. Domain-Specific Variants and Implementation Details
Image Layer Separation
ReflexSplit's LFSB implements fusion and separation at multiple decoder scales, alternating the two to balance shared structure extraction with inter-layer isolation. In coordinate-based NIR fusion (Nam et al., 2021), the LFSB paradigm takes the form of MLPs with scene-aligned inputs, enabling unsupervised fusion into a canonical view and explicit separation of dynamic/interference layers.
Speech Foundation Models
LFSB is instantiated as a fusion–separation interface operating over multiple models and layers. Key variants are:
- HConv (additive): summing aligned model/layer feature maps.
- CHConv (concatenative): concatenating then convolving across layers. After hierarchical 1D convolutional fusion, features are sent to task heads without additional separation; the "separation" is limited to final projections by the downstream head.
6. Theoretical Insights and Representational Advantages
Alternating fusion and separation within an LFSB counteracts under- or over-sharing between streams. Immediate separation after fusion blocks the bleed-through of non-target cues and enforces layer specificity, but repeated fusion allows crucial global structure to be shared and utilized, especially under nonlinear mixing (e.g., ). This prevents premature stream collapse or excessive independence, both of which degrade disentanglement in deep models (Lee et al., 24 Jan 2026). Residual feed-forward components stabilize gradient propagation, and progressive curriculum on separation makes deep coordination tractable.
Richer intermediate features, as shown by t-SNE/PCA, and qualitative reduction in artifacts such as color bleeding and residual reflections attest to the effectiveness of LFSB frameworks in both reconstruction fidelity and interpretability (Lee et al., 24 Jan 2026, Nam et al., 2021). In speech processing, strong gains are achievable by leveraging the model/layer fusion inductive bias for more effective use of diverse upstream knowledge (Shih et al., 11 Nov 2025).
7. Applications and Evaluation Protocols
LFSBs have demonstrated effectiveness in:
- Single-image reflection/transmission separation with high perceptual and quantitative accuracy (Lee et al., 24 Jan 2026).
- Multi-frame fusion for burst denoising, de-moiréing, super-resolution, occlusion removal, and dynamic scene analysis, using unsupervised NIR-driven architectures (Nam et al., 2021).
- Foundation model fusion in speech, improving downstream ASR, speaker verification, and emotion recognition performance (Shih et al., 11 Nov 2025).
Evaluation metrics include PSNR, SSIM, NCC, and structure indices for image tasks, and word error rate (WER), character error rate (CER), equal error rate (EER), and accuracy for speech, under standardized datasets and controlled ablation protocols. No LFSB designs introduce post-processing or reliance on learned generative priors (Lee et al., 24 Jan 2026, Nam et al., 2021).
In summary, the Layer Fusion–Separation Block is a modular pattern that underpins state-of-the-art layer disentanglement and fusion across image and speech tasks, balancing shared representation learning with robust, stream-specific feature refinement. Its design directly addresses the core challenge of nonlinear source entanglement, enabling new performance levels in signal separation and representation integration (Lee et al., 24 Jan 2026, Shih et al., 11 Nov 2025, Nam et al., 2021).