Convolutional Audio Transformer (CAT)

Updated 5 February 2026

CAT is a neural architecture that integrates convolutional modules for local feature extraction with transformer self-attention for capturing long-range dependencies in audio signals.
It employs a hierarchical, multi-scale design with advanced training strategies, such as representation regularization, achieving state-of-the-art results on benchmarks like AudioSet and ESC-50.
Variants including speech recognition CATs, MuSLCAT for music modeling, and multi-resolution CATs demonstrate its flexibility and efficiency across diverse audio processing tasks.

The Convolutional Audio Transformer (CAT) is a neural architecture that integrates convolutional and transformer-based self-attention mechanisms for efficient audio and speech representation learning. CAT variants have been developed for applications in speech recognition, audio tagging, and music modeling, leveraging convolutional modules for local feature extraction and self-attention for global context modeling. These systems are distinguished by their hierarchical, multi-scale organization and advanced training strategies, achieving state-of-the-art efficiency and performance across a range of audio understanding benchmarks (Jeon et al., 2023, Han et al., 29 Jan 2026, Middlebrook et al., 2021).

1. Architectural Foundations

CAT models are defined by the systematic combination of convolutional modules—which confer local, shift-invariant inductive biases—and transformer self-attention modules, which facilitate long-range dependency modeling. A canonical CAT encoder stacks layers that alternate between these two operations:

Convolutional Module: Typically a sequence of Layer Normalization, pointwise expansion, depthwise convolution, nonlinearity (SwiGLU or similar), and projection, followed by a residual connection.
Self-Attention Module: A stack including Layer Normalization, multi-head self-attention (MHA), feedforward sublayers (often with SwiGLU), and residual connections.

The forward pass for a single CAT layer is thus: $W_1 \in \mathbb{R}^{D' \times D}$ 3 A full encoder stacks $L$ such CAT layers (Jeon et al., 2023).

In music modeling, as in MuSLCAT, CAT architectures process raw waveforms via two parallel convolutional-attention networks targeting different spectral bands, followed by a modified BERT backend. Each frontend is constructed from deep stacks of convolution, Squeeze-and-Excitation (SE), and Attention-Augmented Convolution (AAC) blocks (Middlebrook et al., 2021).

Advanced designs such as the Multi-resolution Block (Han et al., 29 Jan 2026) generalize this principle: they extract and fuse features at multiple temporal–spectral patch scales, with hierarchical downsampling and fusion operations prior to the transformer backbone.

2. Mathematical Formalism

The convolutional block in CAT is parameterized as follows for input $X \in \mathbb{R}^{T \times D}$ :

Pointwise expansion: $Z^{(1)}_t = W_1 X_t + b_1$ , $W_1 \in \mathbb{R}^{D' \times D}$
Depthwise convolution: $Z^{(2)}_{t, j} = \sum_{\tau = -\lfloor k/2 \rfloor}^{\lfloor k/2 \rfloor} P_{j, \tau} Z^{(1)}_{t+\tau, j}$ , $P \in \mathbb{R}^{D' \times k}$
SwiGLU Nonlinearity: Split $Z^{(2)}_t$ into $U_t, V_t$ , $S_t = U_t \odot \sigma(V_t)$
Projection: $C_t = W_2 S_t + b_2$ , $X \in \mathbb{R}^{T \times D}$ 0

Self-attention is parameterized:

$X \in \mathbb{R}^{T \times D}$ 1, $X \in \mathbb{R}^{T \times D}$ 2, $X \in \mathbb{R}^{T \times D}$ 3, $X \in \mathbb{R}^{T \times D}$ 4
$X \in \mathbb{R}^{T \times D}$ 5
Multi-head: $X \in \mathbb{R}^{T \times D}$ 6 (Jeon et al., 2023).

Multi-resolution CAT modules (Han et al., 29 Jan 2026) define a set of resolution parameters $X \in \mathbb{R}^{T \times D}$ 7; each level applies 2D convolutional "patch embed", a stack of $X \in \mathbb{R}^{T \times D}$ 8 convolutions, and projects feature maps to a consistent downsampled size. Fused, these yield patch tokens for the transformer.

3. CAT Variants and Specialized Blocks

Distinct CAT variants reflect application-specific requirements:

Speech Recognition CATs (Jeon et al., 2023): Employ alternating convolution and attention for efficient ASR encoding. Fast-convolution variants mirror Conformer and Squeezeformer architectures. Quantized versions leverage low-bit (BiT) weights, with pure self-attention showing the best resilience to quantization-induced error accumulation.
Music Modeling CATs (MuSLCAT) (Middlebrook et al., 2021): Fuse parallel frequency-biased CAN frontends with AAC blocks—layerwise concatenations of convolutional and self-attention outputs—and downstream Transformer backends for song- or clip-level tagging.
Multi-scale Audio Understanding CATs (Han et al., 29 Jan 2026): Include the Multi-resolution Block for hierarchical feature aggregation, extensive patch masking, and additional Representation Regularization objectives to guide learning via external semantic encoders.

A key innovation is the AAC block, where convolution and multi-head attention outputs are concatenated along the channel dimension to form enriched local-global feature maps (Middlebrook et al., 2021).

4. Representation Regularization and Training Strategies

Recent CAT developments introduce auxiliary objectives—Representation Regularization (RR)—inspired by generative modeling (Han et al., 29 Jan 2026). This approach guides the student model by aligning its patch and [CLS] representations with "teacher" outputs from frozen external encoders (e.g., CLAP, AST, Audio-MAE):

Given:

Student encoder $X \in \mathbb{R}^{T \times D}$ 9 and projector $Z^{(1)}_t = W_1 X_t + b_1$ 0
Teacher encoder $Z^{(1)}_t = W_1 X_t + b_1$ 1 (EMA-updated, frozen)
Alignment head $Z^{(1)}_t = W_1 X_t + b_1$ 2

Losses include:

Patch prediction: $Z^{(1)}_t = W_1 X_t + b_1$ 3
CLS token alignment: $Z^{(1)}_t = W_1 X_t + b_1$ 4
External alignment: $Z^{(1)}_t = W_1 X_t + b_1$ 5

Full objective: $Z^{(1)}_t = W_1 X_t + b_1$ 6, typically with $Z^{(1)}_t = W_1 X_t + b_1$ 7. Masking strategies (Inverse Block Masking, 80%) efficiently regularize training.

Standard optimization uses AdamW, cosine decay schedules, and augmentations such as SpecAugment and mixup. Training on AudioSet unbalanced (AS-2M, $Z^{(1)}_t = W_1 X_t + b_1$ 81.9M clips) with full pretrain runs of 400K steps converges CAT 5× faster than prior methods, reaching 37.9% mAP on AS-20K after 20K steps (compared to $Z^{(1)}_t = W_1 X_t + b_1$ 9100K steps for baselines) (Han et al., 29 Jan 2026).

5. Empirical Results and Efficiency Analysis

CAT architectures deliver strong empirical results across several audio domains:

Model	Params	Pre-train	AS-2M mAP	AS-20K mAP	ESC-50 Acc	SPCV2 Acc
CAT	91M	AS-2M	50.2%	47.8%	98.6%	98.3%
Next best SSL (ASDA)	93M	AS	49.0%	41.5%	96.1%	98.3%

1-bit quantized CATs (BiT) achieve >90% model size and FLOP reduction with a 2–3× increase in ASR WER. Pure self-attention backbones are more robust to aggressive quantization than mixed conv-attention stacks (Jeon et al., 2023).
Music tagging with MuSLCAT matches or slightly improves upon SampleCNN+SE at 34% fewer parameters: ROC-AUC 0.8239 (MuSLCAT) vs. 0.8233, PR-AUC 0.2793 vs. 0.2784 on MTG-Jamendo (Middlebrook et al., 2021).
Representation Regularized CAT outperforms baseline self-supervised models by >6 mAP on AudioSet-20K; ablations confirm that both multi-resolution and RR objectives are necessary for optimal convergence and final accuracy (Han et al., 29 Jan 2026).

Practical profiling reveals that Squeezeformer/Conformer-style CATs reduce storage by ∼64% and FLOP by ∼75% vs. vanilla transformers (e.g., 22.1 GFLOPs and 132MB vs. 110.5 GFLOPs and 184.4MB for comparable setups) (Jeon et al., 2023).

6. Design Hyperparameters and Optimization Guidelines

Guidelines for effective CAT design include:

Convolutional kernel size: 15–31 for a receptive field of 0.3–0.6s at 50Hz frame rate (Jeon et al., 2023).
Expansion ratio: D′ ≈ 1.5–2×D for pointwise convolution projections.
Multi-head attention: Maintain per-head dimension $W_1 \in \mathbb{R}^{D' \times D}$ 0; e.g., $W_1 \in \mathbb{R}^{D' \times D}$ 1, $W_1 \in \mathbb{R}^{D' \times D}$ 2.
Quantization: 8-bit quantization yields <10% storage reduction with minimal loss; 1-bit provides 93% reduction but triples error unless using pure self-attention (Jeon et al., 2023).
Model shape: Deep & narrow (L > 12, D < 300) for on-device or low-memory; shallow & wide (L < 8, D > 512) for higher accuracy at increased cost.

Multi-resolution setups ({4,8,16}) provide the optimal tradeoff between parameter count and performance in audio understanding (Han et al., 29 Jan 2026). Excessive scale ({4,8,16,32}) degrades accuracy and increases computational cost.

7. Extensions and Future Directions

CAT frameworks permit several extensions:

Sparse and dynamic convolution: Strided or dynamically-learned kernels for efficient time subsampling and adaptive locality.
Hybrid quantization: Assigning lower bit width to self-attention layers than to convolutional modules to optimize the quantization-robustness tradeoff.
External guidance and distillation: CATs can be further improved by ensembling or aligning representations with multiple pre-trained semantic encoders (e.g., CLAP, Audio-MAE), employing knowledge distillation for compact deployment models (Han et al., 29 Jan 2026).
Task transfer: CATs generalize to speech recognition pre-training with CTC or seq2seq heads, and can be adapted for symbolic or generative audio tasks.
Layerwise regularization: Empirical data indicate that aligning student and teacher representations at the final transformer layer is most beneficial for downstream metrics.

A plausible implication is that the modular and multi-scale design of CAT architectures will continue to offer state-of-the-art tradeoffs in accuracy, inference efficiency, and deployability as audio understanding and generation models expand in scope and capacity (Han et al., 29 Jan 2026).