Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLAP: Cross-Layer Attention Pruning

Updated 21 December 2025
  • Cross-Layer Attention Pruning (CLAP) is a method for reducing neural network redundancy by selectively merging or skipping attention modules in both Transformer and convolutional architectures.
  • It uses importance aggregation for KV-group merging and KL-divergence-based techniques to identify and preserve critical attention components during aggressive pruning.
  • CLAP enhances model efficiency in frameworks like Pangu Light and ELA, achieving improved accuracy and faster training with reduced computational overhead.

Cross-Layer Attention Pruning (CLAP) refers to a class of mechanisms for reducing the redundancy and computational footprint of neural networks by selectively pruning attention modules or merging attention parameters across layers. In recent literature, two principal instantiations have emerged: as a weight re-initialization and KV-group merging procedure for aggressive pruning in Transformer-based architectures, exemplified by the "Pangu Light" framework (Chen et al., 26 May 2025), and as a KL-divergence-based module-skipping protocol for redundant layer-attention modules in deep convolutional networks, notably in the Efficient Layer Attention (ELA) architecture (Li et al., 9 Mar 2025). Both approaches share the objective of preserving model representational capacity and stabilizing training under compression, while differing sharply in their formal methodology and target architectural motifs.

1. Algorithmic Foundations of CLAP

In Transformer-based models, Cross-Layer Attention Pruning is deployed post structured pruning, focusing on the selective harvesting and transfer of high-importance Key-Value (KV) groups from layers scheduled for removal (specifically, from layer l+1l+1 to layer ll) (Chen et al., 26 May 2025). The head importance for each attention head jj in layer ll is defined by:

Shead,lj=XXlAttnHeadj(X;Wl,jQ,Wl,jK,Wl,jV)2S_{\text{head},l}^{j} = \sum_{\mathbf{X} \in \mathcal{X}_l} \Bigl\| \text{AttnHead}_j(\mathbf{X}; W_{l,j}^Q, W_{l,j}^K, W_{l,j}^V) \Bigr\|_2

After within-layer head pruning, the importance of each surviving KV-group GgG_g is computed as:

Skv-groupg=1Nq(Gg)jHeadss(Gg)Shead,ori(j)jS_{\text{kv-group}}^{g} = \frac{1}{N_q(G_g)} \sum_{j \in \mathrm{Heads}_s(G_g)} S_{\text{head}, \mathrm{ori}(j)}^{j}

The top-KK KV-groups (across layers ll and l+1l+1) are then selected, merged, and used to re-initialize projections in layer ll.

Separately, in convolutional backbones augmented with multi-head retrieval layer attention (MRLA), CLAP quantifies redundancy between adjacent layer attention distributions via KL-divergence (Li et al., 9 Mar 2025):

DKL(PP)=i=1lpilogpipi+ϵlogϵpl+1D_{\text{KL}}(P \,\|\, P') = \sum_{i=1}^{l} p_i \log \frac{p_i}{p_i'} + \epsilon \log \frac{\epsilon}{p_{l+1}'}

Here, PP and PP' are the attention distributions for layers ll and l+1l+1, respectively. Small divergence indicates near-redundant attention mixing, and such modules can be adaptively pruned using Enhanced Beta Quantile Mapping (EBQM).

2. Motivation and Theoretical Justifications

CLAP is motivated by the adverse effects of joint width and depth reductions in large-scale neural networks. Depth pruning often severs vital attention subspaces that are non-redundant, causing sharp accuracy drops. By merging or skipping highly informative attention elements across layers, CLAP mitigates the representational deficit while providing a stabilized initialization for subsequent fine-tuning (Chen et al., 26 May 2025).

In ELA, the motivation is to address the feature-redundancy across sequential layer-attention modules, wherein attention weights learned by adjacent modules often become similar—leading to inefficient feature recombination and unnecessary training cost (Li et al., 9 Mar 2025). KL-divergence acts as an empirical redundancy detector, enabling precise identification and removal of modules that offer little additional value.

A plausible implication is that cross-layer attention pruning adapts the model's architectural depth and width in a content-aware fashion, rather than through naïve fixed-threshold or uniform strategies.

3. Step-by-Step Procedures and Pseudocode

For Transformer-based CLAP, the process following aggressive pruning is as follows (Chen et al., 26 May 2025):

  1. Prune low-importance query heads within each KV-group in the retained (ll) and soon-to-be-pruned (l+1l+1) layers.
  2. Compute KV-group importance via mean head importance.
  3. Jointly rank all KV-groups from both layers.
  4. Select top-KK groups to match the capacity of the original layer ll.
  5. Initialize new projection matrices for layer ll and successively copy (merge) parameters from selected KV-groups.
  6. Replace layer ll’s projections with merged matrices.
  7. Remove layer l+1l+1 from the architecture.

For ELA, CLAP employs EBQM for module selection (Li et al., 9 Mar 2025):

  • Compute KL-divergences between adjacent modules.
  • At designated pruning epochs, sort KL values, perform gamma-level (e.g., γ=0.5\gamma=0.5) quantile mapping, normalize scores, and use the Beta CDF for smoothed thresholding.
  • Mark modules as "to keep" or "to skip" using the EBQM output.
  • Apply the retention mask to select which layer attention modules remain active in the final backbone.
CLAP Variant Redundancy Detection Pruning Mechanism
Transformer-based (Chen et al., 26 May 2025) Importance aggregation KV-group merging
ELA (ConvNet) (Li et al., 9 Mar 2025) KL-divergence/EBQM Dynamic module skipping

4. Implementation Considerations and Integration

CLAP is integrated immediately after standard pruning steps and prior to re-training. In Transformer-based applications such as Pangu Light, no additional learnable parameters are introduced; only high-importance weights are reused from pruned layers, and the KV-group structure is preserved at the module level (Chen et al., 26 May 2025). This design ensures that query-to-KV mappings remain semantically consistent.

In convolutional ELA structures, CLAP acts as a dynamic module selector within the backbone, preserving the forward path for layers bypassed during pruning. EBQM is enacted at specific scheduled epochs, and stability is ensured by windowed averaging of KL scores and smoothed thresholding (Li et al., 9 Mar 2025). For detection models, CLAP is applied analogously with modified hyperparameters.

Both protocols are compatible with standard stochastic gradient optimizers (SGD with momentum), canonical learning-rate schedules, and data augmentation pipelines. After pruning, training resumes with usual optimization recipes; in Pangu Light, knowledge distillation is used during recovery (Chen et al., 26 May 2025).

5. Empirical Validation and Benchmark Results

Pangu Light, augmented with CLAP, achieves notable improvements in model accuracy and efficiency after aggressive pruning (Chen et al., 26 May 2025). On six language modeling and reasoning benchmarks, CLAP provides an average score increase of +2.9 over the Minitron baseline, with larger gains on MMLU and C-Eval. Combination with stabilized LayerNorm re-initialization further amplifies performance to +3.6 average. Such preservation of high-importance attention subspaces enables deeper pruning with diminished accuracy loss.

In ELA, CLAP yields 30–35% faster training while boosting top-1 classification accuracy across CIFAR and ImageNet (e.g., ResNet-20 baseline: 91.35% → ELA: 92.45% for CIFAR-10) (Li et al., 9 Mar 2025). Object detection performance is maintained or modestly improved (Faster R-CNN AP: MRLA-L 40.4 → ELA 40.5). Ablation studies demonstrate that EBQM outperforms threshold-based and fixed-layer pruning (ResNet-56/CIFAR-100: EBQM 74.31%±0.24 vs. threshold 73.89%±0.44).

Distributional choices for the quantile mapping mechanism (Beta vs. Gamma vs. Normal) also favor Beta CDF for stability and accuracy.

CLAP should be distinguished from naïve thresholding, fixed-layer skipping, and scalar-importance pruning. Its aggregation-based KV-group merging (Transformers) and KL-divergence-driven selection (layer attention modules) are more robust, as supported by ablation results (Chen et al., 26 May 2025, Li et al., 9 Mar 2025). The smoothed quantile mapping provided by EBQM in ELA avoids abrupt accuracy collapses and facilitates retraining convergence.

However, CLAP requires accurate estimation of head and group importance (for Transformers) or reliable KL-divergence computation (for layer-attention modules), which may be sensitive to the duration, choice of averaging windows, and the representational diversity of inputs. This suggests a need for further research on adaptive importance/ redundancy quantification under domain-shift or adversarial corruptions.

A plausible implication is that CLAP-style cross-layer attention selection protocols can be generalized to diverse architectures for more granular model adaptation and compression, particularly as neural backbone complexity increases. The modularity and non-parametric weight transfer in CLAP bear potential for synergistic integration with knowledge distillation, structured sparsity, and dynamic inference acceleration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Layer Attention Pruning (CLAP).