Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conformer-Based Bottleneck Fusion Module

Updated 25 January 2026
  • The module introduces a constrained cross-modal fusion using a limited set of bottleneck tokens, ensuring precise audio-visual information exchange.
  • It employs modality-specific Conformer encoders with joint enhancement and recognition objectives, resulting in significant WER improvements in noisy environments.
  • Empirical results on the LRS3 benchmark show that this approach reduces redundancy and enhances semantic preservation compared to traditional mask-based methods.

A Conformer-Based Bottleneck Fusion Module is an architecture designed for robust audio-visual speech recognition (AVSR) under noisy conditions, integrating a constrained cross-modal information flow using Conformer blocks and explicit bottleneck tokens. This approach is characterized by the introduction of a limited number of bottleneck tokens that mediate audio-visual exchange, structured Conformer-based encoders per modality, and joint training with enhancement and recognition objectives. The design avoids explicit noise mask mechanisms, instead relying on bottleneck-mediated purification to reduce redundancy and guide information sharing (Wu et al., 18 Jan 2026).

1. Architecture and Data Flow

The underlying pipeline comprises two modality-specific front-end encoders—one for audio, one for video—followed by a multi-layer bottleneck fusion module and a fusion Conformer. Audio input xaRF×80x_a \in \mathbb{R}^{F \times 80} is first passed through two 1D sub-sampling convolution layers (kernel=3, stride=2, output channels d=512d=512), yielding Na=F/4N_a = \lceil F/4 \rceil time steps, then a linear projection, positional embedding, and three Conformer encoder blocks, resulting in feature sequence haRNa×dh_a \in \mathbb{R}^{N_a \times d}. Video input xvRT×96×96x_v \in \mathbb{R}^{T \times 96 \times 96} (mouth ROI, 25 Hz) is processed by a 3D conv (kernel=5x7x7, stride=(1,2,2)), a 2D ResNet-18, then a linear projection and positional embedding, with three Conformer blocks, yielding hvRNv×dh_v \in \mathbb{R}^{N_v \times d}, Nv=TN_v=T.

A set of K=4K=4 bottleneck tokens b0RK×db^0 \in \mathbb{R}^{K \times d} is initialized. For L=3L=3 layers, in layer ll, each branch (audio, video) receives the concatenation of its modality features hmlh_m^l and current bottleneck tokens blb^l:

  • Sml=concat(hml,bl)R(Nm+K)×dS_m^l = \text{concat}(h_m^l, b^l) \in \mathbb{R}^{(N_m + K) \times d}, m{a,v}m \in \{a, v\}.
  • Passes through a modality-specific Conformer, yielding updated feature and bottleneck outputs [hml+1;bml+1][h_m^{l+1}; b_m^{l+1}].
  • Updated bottleneck tokens are fused by simple averaging: bl+1=12(bvl+1+bal+1)b^{l+1} = \frac{1}{2}(b_v^{l+1} + b_a^{l+1}).

Post LL layers, purified modality sequences za,zvz_a, z_v are obtained; these are concatenated (Z=[za;zv]Z = [z_a ; z_v]), processed by an additional three-block fusion Conformer, and split for CTC and decoder attention heads.

2. Bottleneck Fusion Mechanism and Mathematical Formulation

The central mechanism enforces that inter-modal transfer occurs exclusively through a small set of bottleneck tokens, formalized per fusion layer ll:

  • blRK×db^l \in \mathbb{R}^{K \times d}, hmlRNm×dh_m^l \in \mathbb{R}^{N_m \times d}.
  • Sml=[hml;bl]S_m^l = [h_m^l; b^l], input to Conformerm_m.
  • Output: [hml+1;bml+1]=Conformerm(Sml)[h_m^{l+1}; b_m^{l+1}] = \text{Conformer}_m(S_m^l).
  • Bottleneck update: bl+1=12(bvl+1+bal+1)b^{l+1} = \frac{1}{2}(b_v^{l+1} + b_a^{l+1}).

Each Conformer block operates as:

  • Y1=X+12FFN(LN(X))Y_1 = X + \frac{1}{2} \cdot \text{FFN}(\text{LN}(X))
  • Y2=Y1+MHSA(LN(Y1))Y_2 = Y_1 + \text{MHSA}(\text{LN}(Y_1))
  • Y3=Y2+ConvModule(LN(Y2))Y_3 = Y_2 + \text{ConvModule}(\text{LN}(Y_2))
  • Y4=Y3+12FFN(LN(Y3))Y_4 = Y_3 + \frac{1}{2} \cdot \text{FFN}(\text{LN}(Y_3))
  • Output: LN(Y4)\text{LN}(Y_4)

The bottleneck tokens impose a K-dimensional information funnel from modality-specific features: all cross-modal information must traverse this lower-dimensional conduit, effectively purifying and compressing cross-modal dependencies and minimizing modality-specific noise propagation (Wu et al., 18 Jan 2026).

3. Training Objectives and Optimization

Training employs a hybrid loss incorporating speech enhancement and AVSR recognition objectives. For enhancement, a weighted sum of spectral L1L_1 and perceptual L2L_2 losses drives the purified audio tokens zaz_a toward fidelity with a clean reference:

  • Lrecon=x^axaclean1L_\text{recon} = \| \hat{x}_a - x_a^\text{clean} \|_1
  • Lpercep=F(x^a)F(xaclean)22L_\text{percep} = \| \mathcal{F}(\hat{x}_a) - \mathcal{F}(x_a^\text{clean}) \|_2^2, with F\mathcal{F} a frozen front-end encoder
  • Lenh=α1Lrecon+α2LpercepL_\text{enh} = \alpha_1 L_\text{recon} + \alpha_2 L_\text{percep}, α1=α2=0.1\alpha_1 = \alpha_2 = 0.1

Recognition loss LAVSRL_\text{AVSR} consists of:

  • CTC term: Lctc=logpctc(yfv)logpctc(yfa)L_\text{ctc} = -\log p_\text{ctc}(y | f_v) - \log p_\text{ctc}(y | f_a), using projected fusion Conformer outputs fvf_v, faf_a
  • Attention decoder cross-entropy: Latt=logpatt(yfa)L_\text{att} = -\log p_\text{att}(y | f_a)
  • Combination: LAVSR=λLctc+(1λ)LattL_\text{AVSR} = \lambda L_\text{ctc} + (1 - \lambda) L_\text{att}, λ=0.1\lambda = 0.1

Total objective: Ltotal=LAVSR+LenhL_\text{total} = L_\text{AVSR} + L_\text{enh}. Curriculum learning is applied, starting with high SNR audio and only AVSR loss, then full SNR with joint loss, optimizing with AdamW and data augmentation.

4. Key Hyper-parameters and Regularization Schemes

The architecture selects d=512d=512 (embedding dimension), K=4K=4 (bottleneck width), L=3L=3 (fusion layers), and Conformer block configurations: h=4h=4 attention heads, dff=2048d_\text{ff}=2048, kernel size =31=31 for depth-wise convolution. The Transformer decoder comprises 6 layers, 4 heads, d=512d=512, dff=2048d_\text{ff}=2048. Optimization deploys AdamW (lr=103\text{lr}=10^{-3}, cosine annealing, linear warm-up, batch size =16=16, 70 epochs). Regularization includes curriculum on SNR, time-masking, audio noise-mixing (7.5-7.5 to $17.5$ dB), and video data augmentation.

5. Empirical Results and Comparative Analysis

On the LRS3 benchmark, the module achieves notable word error rates (WER) under various SNRs:

  • Under 5-5 dB babble, bottlenecked fusion (K=4K=4) yields 8.5%8.5\% WER, outperforming direct cross-attention (K=0K=0, 12.8%\sim12.8\%).
  • Full pipeline averages 3.9%3.9\% WER across clean and noisy splits, surpassing prior mask-based AVSR methods (e.g., V-CAFE, Joint AVSE-AVSR, AV-RelScore, best prior 4.3%4.3\%).
  • Clean condition improves to 2.1%2.1\% WER (2.8%2.8\% for AV-RelScore baseline).
  • Ablations reveal LenhL_\text{enh} offers a 1.7%1.7\% absolute WER reduction, underscoring the value of bottleneck-based purification.

6. Significance and Context within AVSR Research

Mask-based feature fusion has been prevalent in AVSR to filter audio noise, at the cost of potentially discarding speech-relevant information. The Conformer-based bottleneck fusion module addresses this by enforcing all cross-modal communication through a minimal KK-token bottleneck, guided by joint enhancement and recognition losses, without explicit mask generation. This structure reduces modality redundancy and enhances semantic preservation. A plausible implication is that bottlenecked fusion architectures may represent a generalizable direction for robust multimodal learning under noisy input regimes.

7. Connections to Conformer Family and Broader Methodological Framework

The bottleneck fusion mechanism leverages the Conformer architecture’s capacity to jointly model global attention (via MHSA) and local context (via convolutional modules), aligning with precedents in ASR backbone design (Song et al., 2022). While the FusionFormer variant targets layer and operator fusion for efficient normalization-free inference, the bottleneck fusion applies Conformer blocks in a staged, cross-modal manner, underscoring the adaptability of the Conformer model class to both efficiency and robustness-driven AVSR innovations.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conformer-Based Bottleneck Fusion Module.