SCMA: Self-Compression via MARL

Updated 5 February 2026

The paper introduces SCMA, a paradigm that uses multi-agent reinforcement learning to compress verbose chain-of-thought outputs without losing logical integrity.
SCMA employs cooperative segmentation and chunk-level importance estimation to isolate and preserve critical reasoning steps while pruning redundancy.
Empirical results indicate that SCMA significantly reduces response lengths and maintains or improves accuracy compared to traditional reinforcement learning compression methods.

Self-Compression via Multi-Agent Reinforcement Learning (SCMA) is a paradigm designed to address the inference inefficiency of Large Reasoning Models (LRMs) by selectively compressing their Chain-of-Thought (CoT) outputs without compromising accuracy. The framework employs a multi-agent reinforcement learning (MARL) approach, incorporating cooperative segmentation and chunk-level importance estimation to drive concise yet logically complete reasoning. SCMA achieves substantial reductions in CoT length and improves accuracy relative to traditional reinforcement learning (RL) methods that use undifferentiated length penalties (Chen et al., 29 Jan 2026).

1. Motivating Redundancy-Control in Chain-of-Thought Reasoning

LRMs often generate excessively verbose CoT traces, including meta-statements and redundant verification, which inflate inference latency and degrade the interactive experience. Standard RL-based compression methods impose a uniform length penalty, formalized as

$R(y|x) = R_{acc}(y|x) - \lambda \cdot f(|y|)$

where $R_{acc}$ is the outcome-based reward and $f(\cdot)$ maps CoT length to penalty. This approach struggles to balance brevity and correctness, often excising steps vital to logical integrity in pursuit of shorter responses. The core insight underlying SCMA is the need to (a) decompose the CoT into discrete logical chunks, and (b) target penalties exclusively at redundant, low-importance chunks, thereby conserving essential reasoning logic (Chen et al., 29 Jan 2026).

2. Multi-Agent Decomposition: Architecture of SCMA

SCMA is implemented as a cooperative MARL system with three functional agents, all sharing the same base LLM parameterization $\theta_{base}$ but prompted distinctly:

Reasoning Agent ( $\pi_{reason}$ ): Receives the problem prompt $x$ and a "think step by step" directive, generating the full CoT $y = (y_1, ..., y_T)$ . This agent seeks to maximize a global reward balancing accuracy and brevity at a chunk-level granularity.
Segmentation Agent ( $\pi_{seg}$ ): Takes $y$ and applies segmentation prompts, marking chunk boundaries with $\langle seg\rangle...\langle/seg\rangle$ . The objective is to parse $y$ into $S = \{s_1,...,s_n\}$ , capturing minimal, semantically integral units for subsequent evaluation.
Scoring Agent ( $\pi_{score}$ ): Consumes the chunk sequence $S$ and, via distinct scoring prompts, labels each chunk $s_i$ with an importance score $w_i \in \{1, ..., 5\}$ , indicating its necessity for correct answer derivation. High-importance chunks are insulated from length penalties, preserving interpretive fidelity.

At inference time, only $\pi_{reason}$ is deployed, ensuring that SCMA does not add runtime overhead, unlike agent ensembles requiring downstream evaluation (Chen et al., 29 Jan 2026).

3. Importance-Weighted Reward Signal

SCMA leverages an importance-weighted length penalty, designed to minimize redundancy selectively:

Score Mapping: For chunk $s_i$ , length $L_i = |s_i|$ , the importance-adjusted penalty uses $\varphi(w_i) = 5 - w_i$ , such that $w_i = 5$ yields $\varphi = 0$ (no penalty).
Chunk Penalty: Aggregate penalty per sample is

$C(y) = \sum_{i=1}^{n} \varphi(w_i) \cdot L_i$

Normalization: The total is divided by $L_{norm} = \max_{y' \in C_{batch}} \sum_j |s'_j|$ over correct candidates in the training batch, constructing

$\ell(y) = C(y) / L_{norm}$

Total Reward: The net reward is

$R_{total}(y|x) = R_{acc}(y|x) - \lambda \sum_{i=1}^{n} (5-w_i)\,L_i\,/\,L_{norm}$

ensuring that brevity incentives are adaptively relaxed on high-importance content.

This contrasts with single-agent length penalties, which indiscriminately discourage all steps equally, predisposing models to omit crucial inferential links (Chen et al., 29 Jan 2026).

4. Cooperative Policy Optimization

Training is formulated as a Markov Game with shared parameters $\theta$ among $\{\pi_{reason}, \pi_{seg}, \pi_{score}\}$ and optimized using Group Relative Policy Optimization (GRPO):

Sampling: For each input $x$ $x$ in batch, $G$ $G$ collaborative trajectories are rolled out:
- $y^{(k)} \sim \pi_{reason}(\cdot|x; \theta)$
- $S^{(k)} \sim \pi_{seg}(\cdot|y^{(k)}; \theta)$
- $w^{(k)} \sim \pi_{score}(\cdot|S^{(k)}; \theta)$
- Compute $R^{(k)} = R_{acc} - \alpha \cdot \ell(y^{(k)})$
Advantage Estimation: Compute group-normalized advantages $A_k$ for each token, aligning policy gradients with the collective global reward.
Parameter Update: Each agent's policy is updated in a clipped PPO-inspired fashion:

$\theta \leftarrow \theta + \eta \nabla_\theta \mathbb{E}[\min(r\cdot A, \mathrm{clip}(r, 1-\epsilon, 1+\epsilon)\cdot A) - \beta \cdot KL(\pi_{old}||\pi_{ref})]$

with $\eta$ the step size, $\epsilon$ the clipping threshold, and $\beta$ the KL penalty scaling (Chen et al., 29 Jan 2026).

Through this cooperative process, the segmentation and scoring subpolicies evolve to reliably delineate and evaluate logical structure, enabling the Reasoning Agent to output compressed, high-density reasoning with minimal risk of logic loss.

5. Empirical Evaluation and Ablation

SCMA has been benchmarked across multiple model and dataset scales:

Models: DeepSeek-R1-Distill-Qwen (1.5B, 7B), Qwen3 (4B, 8B)
Datasets: GSM8K, MATH500, AMC23 (2023), AIME24/25

Performance Metrics

Answer Accuracy (\%)
Average CoT Length (tokens)

Method	Accuracy (%)	Avg. Length (tokens)	Δ Accuracy	Δ Length
Vanilla	94.49	6,459	—	—
GRPO	94.92	5,877	+0.43	–9.0
LC-R1_LP	93.70	4,643	–0.79	–28.1
RL+LP	94.01	4,436	–0.48	–31.3
SCMA (ours)	94.16	3,944	+8.70	–39.0

Across all scales, SCMA achieves a 11.1%–39.0% reduction in response length and accuracy boosts of 4.33%–10.02%, outperforming both vanilla RL and single-agent length-penalty baselines in balancing brevity with correctness (Chen et al., 29 Jan 2026).

Ablation experiments reveal that freezing Segmentation/Scoring agents (i.e., using off-the-shelf Qwen or GPT-4o APIs) while training only $\pi_{reason}$ causes a drop in overall accuracy (by 0.6–3.0 points) and increased CoT length (by 3–12%), indicating the necessity of joint MARL optimization.

6. Training Dynamics and Emergent Behaviors

SCMA induces several emergent properties during training:

Semantic Concentration: Average chunk count decreases while mean importance scores $w_i$ rise, demonstrating that Reasoning and Scoring agents co-evolve to distill logic into fewer, more substantial units.
Content-Adaptive Segmentation: Standard deviation in chunk lengths increases, indicating a shift from uniform chunking to semantically-driven partitions; complex computations produce longer segments, with simple transitions remaining concise.
Convergence Stability: Unlike single-agent RL+LP, which can collapse into “No-Think” minimalism by over-suppressing reasoning and thereby harming accuracy, SCMA’s chunk-wise modulation relaxes compression incentives once redundancy is eliminated, yielding stable policy convergence (Chen et al., 29 Jan 2026).

A qualitative study demonstrates that SCMA-generated CoT traces present only high-density logical steps—e.g., in an arithmetic problem: calculating total eggs used, updating counts, and computing revenue—eliding meta-statements and repeated verification common in GRPO baselines.

7. Theoretical and Practical Implications

SCMA operationalizes fine-grained, self-compression of reasoning traces in LRMs by integrating cooperative chunk segmentation and importance estimation into the RL training workflow. By ensuring that reasoning compression is semantically informed and adaptive, SCMA enables LRMs to attain low-latency, high-fidelity inference without any increase in deployment overhead. This approach currently demonstrates superior cost–performance tradeoffs and highlights the potential for further MARL-driven advances in efficient, robust reasoning architectures (Chen et al., 29 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Compression via MARL (SCMA).