Conformer-Based Classifier

Updated 28 January 2026

Conformer-based classifiers are predictive models that combine convolution for local feature extraction with self-attention for modeling long-range dependencies.
They employ either monolithic stacks or modular routing with domain-specific augmentations and hierarchical pooling to optimize performance across diverse input modalities.
Empirical results show state-of-the-art accuracy in ASR, audio deepfake detection, biomedical audio, MEG decoding, and image classification.

A Conformer-based classifier is a predictive model that leverages the Conformer architecture—an integration of convolutional modules and self-attention mechanisms initially introduced for sequence modeling in speech and vision—to perform classification tasks across diverse signal domains. Such classifiers exploit the complementary strengths of convolutions for local pattern extraction and Transformers for long-range dependency modeling. This hybridization, coupled with domain-specific architectural augmentations, yields state-of-the-art performance in applications spanning automatic speech recognition (ASR), biomedical audio, magnetoencephalographic (MEG) decoding, image categorization, and audio deepfake detection.

1. Architectural Foundations of Conformer-Based Classifiers

Conformer architectures blend CNN-based local feature extraction with Transformer-style global self-attention. Canonically, each Conformer block is composed of a sequence of submodules: (a) pre-norm feed-forward network (FFN), (b) multi-head self-attention (MHSA) with optional relative positional encoding, (c) convolutional module (often comprising pointwise convolution, GLU, depthwise convolution, normalization, and nonlinearity), followed by (d) post-norm FFN. All modules are connected by residual links and interleaved normalization layers.

In practice, conformer-based classifiers adapt this foundation in one of two forms:

Monolithic Conformer Stack: An input sequence is projected to a high-dimensional embedding and passed through a stack of conformer blocks. Temporal pooling or a classification token mechanism produces a fixed-dimensional representation for subsequent dense classification heads (e.g., as in (Marocchi et al., 26 Jan 2026, Shin et al., 2023, Zuazo et al., 1 Dec 2025)).
Modular Routing Structures: The base conformer stack is partitioned into “experts” or modular blocks, with domain-aware gating via a dedicated classifier (“router”) determining expert selection per input (exemplified by (Gibson et al., 2024)).

The distinguishing structural features and mathematical forms for each task are summarized in the following table:

Application Domain	Input Features	Core Conformer Variant	Classification Output Mechanism
Noise-aware ASR	log-Mel FBANK, 80-dim	Modular routing with CNN router, early expert blocks	One-hot domain token or CNN-derived gating, CTC+attn loss (Gibson et al., 2024)
Audio Deepfake	LFCC frames, conv subsample	6-block stack with hierarchical pooling, multi-level CLS aggregation	Multi-level classification tokens, OC-Softmax losses (Shin et al., 2023)
Biomedical Audio	MFCC (multi-channel), 128-dim	3-block stack, contrastive+CE loss	Avg pooling + MLP or SVM, subject-level voting (Marocchi et al., 26 Jan 2026)
MEG Decoding	Raw MEG, 306-ch	Macaron conformer, dynamic config per task	Task-specific heads, mean pooling, weighted loss (Zuazo et al., 1 Dec 2025)
Visual Recognition	RGB image, 224×224	Dual-branch (CNN + Transformer), FCU fusion	Logit summation from CNN/Transformer branches (Peng et al., 2021)

2. Input Modalities and Preprocessing Strategies

The input preparation for conformer-based classifiers is tightly coupled to the signal domain:

Audio (speech, PCG): Features include log-Mel filterbank energies (Gibson et al., 2024), LFCC with delta/delta-delta for deepfake (Shin et al., 2023), or multi-channel MFCC for biomedical tasks (Marocchi et al., 26 Jan 2026). Preprocessing may include normalization, bandpass filtering (e.g., 25–450 Hz for heart sounds), and per-channel energy-based noisy-segment rejection to enhance noise robustness (Marocchi et al., 26 Jan 2026).
MEG: Raw, multi-channel input is projected via 1D convolution to a lower embedding dimension; task-dependent normalization and MEG-specific augmentation are integral (Zuazo et al., 1 Dec 2025).
Vision: Standard 2D image preprocess provides input to a hybrid Conformer backbone, with parallel stem and separate pathways for convolutional and Transformer blocks (Peng et al., 2021).

This suggests that tailoring the feature extraction pipeline and normalization to the signal characteristics critically determines final classifier robustness and accuracy.

3. Conformer Variants and Domain-Specific Adaptations

Current implementations employ domain-specific conformer variants:

Modular Routing (Noise-Aware ASR): A small CNN classifier is trained on environment/domain labels and attached as a router to gate between Conformer experts. Fixed routing prepends a one-hot token, while learned routing uses softmax gating at layer-level granularity. Ablations confirm that simple binary (clean/noisy) gating outperforms fine-grained noisy class routing due to non-separability in spectral features and data fragmentation (Gibson et al., 2024).
Hierarchical Aggregation (Deepfake Detection): Temporal redundancy is mitigated by interleaving pooling layers, with CLS tokens aggregating increasingly abstracted representations through the Conformer stack. Multiple auxiliary classifiers facilitate deep supervision, with final classification via an embedding projected from pooled multi-level CLS and sequence tokens (Shin et al., 2023).
Macaron Configuration (MEG): The sequence of submodules is adjusted (i.e., FFN-attention-conv-FFN-norm) per task, with the number of blocks, heads, and expansion factors tuned empirically for raw sensor decoding (Zuazo et al., 1 Dec 2025).
FCU Fusion (Vision): Dual pathways for local (CNN) and global (Transformer) features are fused at each layer via Feature Coupling Units, allowing interactive exchange of spatial dynamics and semantic context (Peng et al., 2021).

4. Training Recipes and Optimization Regimes

Training pipelines adhere to the following patterns:

Loss Functions: Cross-entropy is the default, augmented by hybrid losses for contrastive learning (Marocchi et al., 26 Jan 2026), OC-Softmax for verification (Shin et al., 2023), weighted softmax for class imbalance (Zuazo et al., 1 Dec 2025), or composite CTC+attention for sequence transduction (Gibson et al., 2024).
Augmentation and Regularization: Data augmentation is extensively employed—SpecAugment, frequency masking, codec perturbation, additive noise (ASVspoof datasets), or MEG-specific masking (“MEGAugment” with bandstop filters) are common (Gibson et al., 2024, Zuazo et al., 1 Dec 2025, Shin et al., 2023). Standard weight decay and dropout are incorporated.
Optimization: Adam or AdamW is ubiquitously used, with task-specific schedules (cosine, inverse sqrt decay, exponential decay). Early stopping and model selection by metric plateau (e.g., F1-macro) are routine.

Empirically, the data indicate that jointly fine-tuning modular routers after initial classifier pretraining yields superior environment adaptation in ASR (Gibson et al., 2024), and that dynamic grouping/normalization is critical for MEG phoneme classification (Zuazo et al., 1 Dec 2025).

5. Application Benchmarks and Performance Results

Conformer-based classifiers have demonstrated the following results:

Task	Dataset	Backbone	Metric	Performance	Reference
Noise-aware ASR	CHiME-4	Modular Conformer	WER	Clean: 10.0% (–6.4%), Real: 26.3% (–3.4%)	(Gibson et al., 2024)
Audio Deepfake Detection	ASVspoof21 DF	HM-Conformer	EER	15.71% (–16% rel. vs baseline)	(Shin et al., 2023)
CAD detection (biomed audio)	PCG (297 subjects)	MFCC-Conformer	Acc/UAR	78.4% / 78.2%	(Marocchi et al., 26 Jan 2026)
MEG Speech Detection	LibriBrain2025	MEGConformer	F1-macro	88.90%	(Zuazo et al., 1 Dec 2025)
Image Classification	ImageNet-1k	Conformer-S	Top-1 Acc	83.4%	(Peng et al., 2021)

Notably, architectural modifications—hierarchical pooling, multi-level tokens, and expert routing—consistently yield substantive improvements over vanilla conformer and prior CNN/Transformer baselines. The "Conformer-S" backbone, for example, achieves +2.3% Top-1 accuracy over DeiT-B for ImageNet (with comparable compute) and +3.7 mAP in COCO detection (Peng et al., 2021).

6. Design Considerations and Empirical Insights

Experimental ablations highlight several critical findings:

Binary domain classifiers outperform fine-grained routing in noisy acoustic environments due to poor linear separability after log-Mel front-ends (Gibson et al., 2024).
Hierarchical pooling with max reduction better condenses redundant information than average pooling or no pooling (Shin et al., 2023).
Dynamic 100-sample averaging and instance-level input normalization are essential for robust MEG phoneme classification (Zuazo et al., 1 Dec 2025).
Jointly aggregating multi-level CLS tokens and sequence-level embeddings substantially enhances audio verification accuracy (Shin et al., 2023).

A plausible implication is that modularity—whether in the form of routing experts, hierarchical processing, or hybrid fusion—provides a generally effective inductive bias for classification tasks where feature locality and global context are both critical, especially under distribution shift or adversarially noisy conditions.

7. Practical Recommendations

Domain-specific workflow choices for effective Conformer-based classification include:

Insert modular or expert routing at early Conformer layers to adapt to domain variations (Gibson et al., 2024).
Employ lightweight CNN routers for robust domain detection, prioritizing binary splits when noise types are not linearly separable (Gibson et al., 2024).
Leverage hierarchical pooling and multi-level classification tokens for tasks requiring multi-resolution evidence aggregation (Shin et al., 2023).
Use hybrid loss functions and balanced weighting schemes to address class imbalance or verification/identification objectives (Shin et al., 2023, Zuazo et al., 1 Dec 2025).
Batch dual branches (for hybrid architectures) in parallel, fuse convolution and normalization during inference, and exploit mixed precision and parallelization to maximize efficiency (Peng et al., 2021).

These patterns, validated across modalities, support generalizability and efficiency of conformer-based classifiers in varied research and applied scenarios.