SCMA: Self-Compression via MARL
- The paper introduces SCMA, a paradigm that uses multi-agent reinforcement learning to compress verbose chain-of-thought outputs without losing logical integrity.
- SCMA employs cooperative segmentation and chunk-level importance estimation to isolate and preserve critical reasoning steps while pruning redundancy.
- Empirical results indicate that SCMA significantly reduces response lengths and maintains or improves accuracy compared to traditional reinforcement learning compression methods.
Self-Compression via Multi-Agent Reinforcement Learning (SCMA) is a paradigm designed to address the inference inefficiency of Large Reasoning Models (LRMs) by selectively compressing their Chain-of-Thought (CoT) outputs without compromising accuracy. The framework employs a multi-agent reinforcement learning (MARL) approach, incorporating cooperative segmentation and chunk-level importance estimation to drive concise yet logically complete reasoning. SCMA achieves substantial reductions in CoT length and improves accuracy relative to traditional reinforcement learning (RL) methods that use undifferentiated length penalties (Chen et al., 29 Jan 2026).
1. Motivating Redundancy-Control in Chain-of-Thought Reasoning
LRMs often generate excessively verbose CoT traces, including meta-statements and redundant verification, which inflate inference latency and degrade the interactive experience. Standard RL-based compression methods impose a uniform length penalty, formalized as
where is the outcome-based reward and maps CoT length to penalty. This approach struggles to balance brevity and correctness, often excising steps vital to logical integrity in pursuit of shorter responses. The core insight underlying SCMA is the need to (a) decompose the CoT into discrete logical chunks, and (b) target penalties exclusively at redundant, low-importance chunks, thereby conserving essential reasoning logic (Chen et al., 29 Jan 2026).
2. Multi-Agent Decomposition: Architecture of SCMA
SCMA is implemented as a cooperative MARL system with three functional agents, all sharing the same base LLM parameterization but prompted distinctly:
- Reasoning Agent (): Receives the problem prompt and a "think step by step" directive, generating the full CoT . This agent seeks to maximize a global reward balancing accuracy and brevity at a chunk-level granularity.
- Segmentation Agent (): Takes and applies segmentation prompts, marking chunk boundaries with . The objective is to parse into , capturing minimal, semantically integral units for subsequent evaluation.
- Scoring Agent (): Consumes the chunk sequence and, via distinct scoring prompts, labels each chunk with an importance score , indicating its necessity for correct answer derivation. High-importance chunks are insulated from length penalties, preserving interpretive fidelity.
At inference time, only is deployed, ensuring that SCMA does not add runtime overhead, unlike agent ensembles requiring downstream evaluation (Chen et al., 29 Jan 2026).
3. Importance-Weighted Reward Signal
SCMA leverages an importance-weighted length penalty, designed to minimize redundancy selectively:
- Score Mapping: For chunk , length , the importance-adjusted penalty uses , such that yields (no penalty).
- Chunk Penalty: Aggregate penalty per sample is
- Normalization: The total is divided by over correct candidates in the training batch, constructing
- Total Reward: The net reward is
ensuring that brevity incentives are adaptively relaxed on high-importance content.
This contrasts with single-agent length penalties, which indiscriminately discourage all steps equally, predisposing models to omit crucial inferential links (Chen et al., 29 Jan 2026).
4. Cooperative Policy Optimization
Training is formulated as a Markov Game with shared parameters among and optimized using Group Relative Policy Optimization (GRPO):
- Sampling: For each input in batch, collaborative trajectories are rolled out:
- Compute
- Advantage Estimation: Compute group-normalized advantages for each token, aligning policy gradients with the collective global reward.
- Parameter Update: Each agent's policy is updated in a clipped PPO-inspired fashion:
with the step size, the clipping threshold, and the KL penalty scaling (Chen et al., 29 Jan 2026).
Through this cooperative process, the segmentation and scoring subpolicies evolve to reliably delineate and evaluate logical structure, enabling the Reasoning Agent to output compressed, high-density reasoning with minimal risk of logic loss.
5. Empirical Evaluation and Ablation
SCMA has been benchmarked across multiple model and dataset scales:
- Models: DeepSeek-R1-Distill-Qwen (1.5B, 7B), Qwen3 (4B, 8B)
- Datasets: GSM8K, MATH500, AMC23 (2023), AIME24/25
Performance Metrics
- Answer Accuracy (\%)
- Average CoT Length (tokens)
| Method | Accuracy (%) | Avg. Length (tokens) | Δ Accuracy | Δ Length |
|---|---|---|---|---|
| Vanilla | 94.49 | 6,459 | — | — |
| GRPO | 94.92 | 5,877 | +0.43 | –9.0 |
| LC-R1_LP | 93.70 | 4,643 | –0.79 | –28.1 |
| RL+LP | 94.01 | 4,436 | –0.48 | –31.3 |
| SCMA (ours) | 94.16 | 3,944 | +8.70 | –39.0 |
Across all scales, SCMA achieves a 11.1%–39.0% reduction in response length and accuracy boosts of 4.33%–10.02%, outperforming both vanilla RL and single-agent length-penalty baselines in balancing brevity with correctness (Chen et al., 29 Jan 2026).
Ablation experiments reveal that freezing Segmentation/Scoring agents (i.e., using off-the-shelf Qwen or GPT-4o APIs) while training only causes a drop in overall accuracy (by 0.6–3.0 points) and increased CoT length (by 3–12%), indicating the necessity of joint MARL optimization.
6. Training Dynamics and Emergent Behaviors
SCMA induces several emergent properties during training:
- Semantic Concentration: Average chunk count decreases while mean importance scores rise, demonstrating that Reasoning and Scoring agents co-evolve to distill logic into fewer, more substantial units.
- Content-Adaptive Segmentation: Standard deviation in chunk lengths increases, indicating a shift from uniform chunking to semantically-driven partitions; complex computations produce longer segments, with simple transitions remaining concise.
- Convergence Stability: Unlike single-agent RL+LP, which can collapse into “No-Think” minimalism by over-suppressing reasoning and thereby harming accuracy, SCMA’s chunk-wise modulation relaxes compression incentives once redundancy is eliminated, yielding stable policy convergence (Chen et al., 29 Jan 2026).
A qualitative study demonstrates that SCMA-generated CoT traces present only high-density logical steps—e.g., in an arithmetic problem: calculating total eggs used, updating counts, and computing revenue—eliding meta-statements and repeated verification common in GRPO baselines.
7. Theoretical and Practical Implications
SCMA operationalizes fine-grained, self-compression of reasoning traces in LRMs by integrating cooperative chunk segmentation and importance estimation into the RL training workflow. By ensuring that reasoning compression is semantically informed and adaptive, SCMA enables LRMs to attain low-latency, high-fidelity inference without any increase in deployment overhead. This approach currently demonstrates superior cost–performance tradeoffs and highlights the potential for further MARL-driven advances in efficient, robust reasoning architectures (Chen et al., 29 Jan 2026).