Compression-AliGnEd Attack (CAGE) in ML
- CAGE is a family of adversarial techniques that exploit compression weaknesses in ML models by manipulating token rankings and exploiting backdoor triggers.
- The methodology employs gradient-based optimization to disrupt token importance, enabling both poisoning attacks and robust model exfiltration under lossy compression.
- Empirical findings show that CAGE can lower robust accuracy, achieve high attack success rates, and drastically reduce data exfiltration times by aligning with compression mechanics.
The Compression-AliGnEd Attack (CAGE) encompasses a family of adversarial methodologies exploiting the interplay between compression mechanisms and model vulnerability in modern machine learning systems. The CAGE concept spans three primary domains: (1) adversarial attacks on compressed large vision-LLMs (LVLMs) exploiting token selection instability, (2) data poisoning attacks leveraging signal-preserving compressions as natural backdoor triggers, and (3) model exfiltration attacks where adversaries optimize compression pipelines to rapidly transmit model parameters through constrained communication channels. Core to all variants is adversarial alignment: the attack is informed by or targeted toward the specifics of the deployed compression, maximizing effectiveness while evading standard detection and mitigation strategies.
1. Compression Alignment in Vision-Language Attacks
CAGE attacks on LVLMs target the vulnerability induced by token pruning or merging. Modern LVLMs extract long visual token sequences and then aggressively compress them using plug-and-play schemes (e.g., VisionZip, VisPruner) to meet efficiency constraints (Zhang et al., 29 Jan 2026). Under compression, models compute token importance, rank tokens, and retain only the highest-scoring subset. This bottleneck introduces a weak link: adversarial input perturbations can manipulate the importance scores, causing the compressor to discard semantically critical tokens and retain corrupted or irrelevant ones (Zhang et al., 17 Jan 2026).
Token Importance Mechanism:
Given visual tokens at compression layer , the score of token is computed using multihead attention, averaging the attention from selected text-token indices onto : Tokens are then sorted, and only the top are retained: Perturbations of small norm can reorder importance, making the compressor preserve the most affected tokens, causing prediction collapse under compression while leaving the uncompressed model's output unchanged (Zhang et al., 17 Jan 2026, Zhang et al., 29 Jan 2026).
2. CAGE Optimization Objectives
CAGE formalizes the attack with mathematically aligned objectives that explicitly account for the compression stage.
2.1 Feature Disruption and Rank Alignment
CAGE attacks against compressed LVLMs (as in (Zhang et al., 29 Jan 2026)) use two aligned losses:
- Expected Feature Disruption (EFD): Focus perturbation on tokens most likely to survive compression:
where is the survival probability of token under a prior on unknown compression budgets, and is normalized cosine distance between clean and perturbed features.
- Rank Distortion Alignment (RDA): Encourage the compressor to select highly corrupted tokens:
with and .
The overall objective is: Gradient-based optimization (PGD-style) is employed, and the approach is agnostic to the compressor's specifics as long as token scores are accessible (Zhang et al., 29 Jan 2026).
2.2 White-box and Black-box Attack Variants
White-box attacks optimize over loss terms that directly manipulate token ranking and semantic drift under known compression. Black-box variants “transfer” universal templates (e.g., high-importance borders) learned on surrogate models to trigger failures in unseen models/configurations by boosting or suppressing ranking scores in plausible compression boundaries (Zhang et al., 17 Jan 2026).
3. CAGE in Poisoning and Backdoor Attacks
In the poisoning context, CAGE exploits the prevalence and properties of lossy compression algorithms or feature-space alignment to plant robust triggers:
Natural Compression Artifacts as Triggers:
By using standard compressors such as JPEG, JPEG2000, or WEBP as a “trigger” transformation , an adversary poisons a fraction of the data:
- in subset , set
- Train using standard cross-entropy loss without modifying architecture or adding explicit triggers The resulting classifier predicts the target class when presented with compressed images—even in clean-label and low-rate regimes. This formulation achieves near-maximum attack success rates (ASR) with invisibly distributed, natural artifacts (Yang et al., 2023).
Feature Consistency Training:
To counteract the degradation of traditional triggers under compression, feature consistency regularization is introduced (Xue et al., 2022):
This loss enforces the deep representations of (uncompressed, compressed) trigger images to be close, making the backdoor robust to any (even unseen) compression operation. Combined with standard classification loss, it yields high attack success rates on compressed backdoored inputs, achieving, e.g., ASR(JPEG), ASR(JP2), ASR(WEBP) ≈ 98% versus <10% for unaligned baselines (Xue et al., 2022).
4. CAGE for Model Exfiltration Attacks
Another CAGE instantiation arises in model-stealing scenarios, targeting the compressibility of neural network weights as the attack vector (Brown et al., 3 Jan 2026). Here, the attacker’s goal is to minimize the exfiltration time by compressing weight tensors far beyond the requirements of standard inference-time compression: The attack pipeline includes aggressive block-wise quantization, multi-stage residual vector quantization, and customized codebook encoding. By relaxing decompression constraints (accepting high reconstruction cost, post-hoc fine-tuning), attackers achieve compression ratios (CR) of –, reducing exfiltration durations from months to days. Model-level and system-level defenses (covariance penalties, rotation symmetries, watermarks, rate limits) vary in efficacy, with forensic watermarking being notably robust (Brown et al., 3 Jan 2026).
| CAGE Domain | Core Mechanism | Target Vulnerability |
|---|---|---|
| LVLM adversarial (EFD) | Distortion on likely-surviving tokens | Token compression ranking |
| Backdoor/poison | Lossy compression as trigger | Input data path |
| Exfiltration | Aggressive model compression | Weight storage/transfer |
5. Empirical Findings and Effectiveness
The effectiveness of CAGE attacks is consistently demonstrated across settings:
- LVLM Token Compression: CAGE achieves lower robust accuracy than encoder-based baselines by as much as 12.2 percentage points less at aggressive budgets (e.g., retaining 64 tokens on TextVQA), accurately exposing vulnerabilities masked by standard attacks (Zhang et al., 29 Jan 2026).
- Backdoor Success Rates: On standard datasets, CAGE backdoors using natural compression triggers maintain both high clean accuracy and ASR (e.g., CIFAR-10: ASR at poisoning, and ASR on JPEG-compressed triggers ), outperforming traditional patch-based triggers under compression (Yang et al., 2023, Xue et al., 2022).
- Model Exfiltration: CAGE pipelines achieve – compression with a smooth rate-distortion curve, cutting exfiltration time from $11$ months to under $4$ days for 70B+ parameter models. Forensic watermarks survive adversarial fine-tuning, providing avenues for post hoc detection (Brown et al., 3 Jan 2026).
6. Defense Strategies and Limitations
Multiple defenses have been evaluated against CAGE attacks in each context:
- LVLMs: Randomization (stochastic selection pool), robustness-aware scoring, and attention-mass detectors marginally improve robustness but cannot eliminate the fundamental ranking-prune vulnerability; detection AUC is near random on adversarial inputs (Zhang et al., 17 Jan 2026, Zhang et al., 29 Jan 2026).
- Backdoor Triggers: STRIP, fine-pruning, neural cleanse, and Grad-CAM all fail to reliably detect CAGE, as the backdoor (compression) artifacts are both globally distributed and visually indistinguishable from benign compression noise (Yang et al., 2023).
- Model Exfiltration: Covariance penalties and weight-space symmetries offer moderate compression increases; watermarks are more robust post-hoc, but no current defense eliminates the leakage pathway given a determined, compression-aligned adversary (Brown et al., 3 Jan 2026).
Empirically and theoretically, the efficiency-robustness-security trade-off pervades these settings: integrating adversarial hardness at the compressor design stage or within adversarial training is necessary for future defense (Zhang et al., 29 Jan 2026).
7. Broader Implications and Research Directions
The emergence of CAGE attacks establishes a critical paradigm for evaluating robustness and privacy in ML pipelines involving any form of lossy transformation—whether for efficiency (token pruning, quantization) or for data storage/communication (compression codecs, archival). The unifying principle is that adversaries can adapt attack strategies (perturbations, poisoning triggers, or bit encodings) to the deployed compression, amplifying their effectiveness and stealth.
This suggests that any security or robustness benchmark which fails to include compression-aligned adversaries may substantially overestimate a system’s safety. Future research must explore mechanism co-design—building compressors, model architectures, and training procedures with provable resistance to CAGE-style attacks, and develop robust detection or attribution when alignment is feasible (Zhang et al., 29 Jan 2026, Zhang et al., 17 Jan 2026, Xue et al., 2022, Yang et al., 2023, Brown et al., 3 Jan 2026).