Papers
Topics
Authors
Recent
Search
2000 character limit reached

SCMA: Self-Compression via MARL

Updated 5 February 2026
  • The paper introduces SCMA, a paradigm that uses multi-agent reinforcement learning to compress verbose chain-of-thought outputs without losing logical integrity.
  • SCMA employs cooperative segmentation and chunk-level importance estimation to isolate and preserve critical reasoning steps while pruning redundancy.
  • Empirical results indicate that SCMA significantly reduces response lengths and maintains or improves accuracy compared to traditional reinforcement learning compression methods.

Self-Compression via Multi-Agent Reinforcement Learning (SCMA) is a paradigm designed to address the inference inefficiency of Large Reasoning Models (LRMs) by selectively compressing their Chain-of-Thought (CoT) outputs without compromising accuracy. The framework employs a multi-agent reinforcement learning (MARL) approach, incorporating cooperative segmentation and chunk-level importance estimation to drive concise yet logically complete reasoning. SCMA achieves substantial reductions in CoT length and improves accuracy relative to traditional reinforcement learning (RL) methods that use undifferentiated length penalties (Chen et al., 29 Jan 2026).

1. Motivating Redundancy-Control in Chain-of-Thought Reasoning

LRMs often generate excessively verbose CoT traces, including meta-statements and redundant verification, which inflate inference latency and degrade the interactive experience. Standard RL-based compression methods impose a uniform length penalty, formalized as

R(yx)=Racc(yx)λf(y)R(y|x) = R_{acc}(y|x) - \lambda \cdot f(|y|)

where RaccR_{acc} is the outcome-based reward and f()f(\cdot) maps CoT length to penalty. This approach struggles to balance brevity and correctness, often excising steps vital to logical integrity in pursuit of shorter responses. The core insight underlying SCMA is the need to (a) decompose the CoT into discrete logical chunks, and (b) target penalties exclusively at redundant, low-importance chunks, thereby conserving essential reasoning logic (Chen et al., 29 Jan 2026).

2. Multi-Agent Decomposition: Architecture of SCMA

SCMA is implemented as a cooperative MARL system with three functional agents, all sharing the same base LLM parameterization θbase\theta_{base} but prompted distinctly:

  • Reasoning Agent (πreason\pi_{reason}): Receives the problem prompt xx and a "think step by step" directive, generating the full CoT y=(y1,...,yT)y = (y_1, ..., y_T). This agent seeks to maximize a global reward balancing accuracy and brevity at a chunk-level granularity.
  • Segmentation Agent (πseg\pi_{seg}): Takes yy and applies segmentation prompts, marking chunk boundaries with seg.../seg\langle seg\rangle...\langle/seg\rangle. The objective is to parse yy into S={s1,...,sn}S = \{s_1,...,s_n\}, capturing minimal, semantically integral units for subsequent evaluation.
  • Scoring Agent (πscore\pi_{score}): Consumes the chunk sequence SS and, via distinct scoring prompts, labels each chunk sis_i with an importance score wi{1,...,5}w_i \in \{1, ..., 5\}, indicating its necessity for correct answer derivation. High-importance chunks are insulated from length penalties, preserving interpretive fidelity.

At inference time, only πreason\pi_{reason} is deployed, ensuring that SCMA does not add runtime overhead, unlike agent ensembles requiring downstream evaluation (Chen et al., 29 Jan 2026).

3. Importance-Weighted Reward Signal

SCMA leverages an importance-weighted length penalty, designed to minimize redundancy selectively:

  • Score Mapping: For chunk sis_i, length Li=siL_i = |s_i|, the importance-adjusted penalty uses φ(wi)=5wi\varphi(w_i) = 5 - w_i, such that wi=5w_i = 5 yields φ=0\varphi = 0 (no penalty).
  • Chunk Penalty: Aggregate penalty per sample is

C(y)=i=1nφ(wi)LiC(y) = \sum_{i=1}^{n} \varphi(w_i) \cdot L_i

  • Normalization: The total is divided by Lnorm=maxyCbatchjsjL_{norm} = \max_{y' \in C_{batch}} \sum_j |s'_j| over correct candidates in the training batch, constructing

(y)=C(y)/Lnorm\ell(y) = C(y) / L_{norm}

  • Total Reward: The net reward is

Rtotal(yx)=Racc(yx)λi=1n(5wi)Li/LnormR_{total}(y|x) = R_{acc}(y|x) - \lambda \sum_{i=1}^{n} (5-w_i)\,L_i\,/\,L_{norm}

ensuring that brevity incentives are adaptively relaxed on high-importance content.

This contrasts with single-agent length penalties, which indiscriminately discourage all steps equally, predisposing models to omit crucial inferential links (Chen et al., 29 Jan 2026).

4. Cooperative Policy Optimization

Training is formulated as a Markov Game with shared parameters θ\theta among {πreason,πseg,πscore}\{\pi_{reason}, \pi_{seg}, \pi_{score}\} and optimized using Group Relative Policy Optimization (GRPO):

  1. Sampling: For each input xx in batch, GG collaborative trajectories are rolled out:
    • y(k)πreason(x;θ)y^{(k)} \sim \pi_{reason}(\cdot|x; \theta)
    • S(k)πseg(y(k);θ)S^{(k)} \sim \pi_{seg}(\cdot|y^{(k)}; \theta)
    • w(k)πscore(S(k);θ)w^{(k)} \sim \pi_{score}(\cdot|S^{(k)}; \theta)
    • Compute R(k)=Raccα(y(k))R^{(k)} = R_{acc} - \alpha \cdot \ell(y^{(k)})
  2. Advantage Estimation: Compute group-normalized advantages AkA_k for each token, aligning policy gradients with the collective global reward.
  3. Parameter Update: Each agent's policy is updated in a clipped PPO-inspired fashion:

θθ+ηθE[min(rA,clip(r,1ϵ,1+ϵ)A)βKL(πoldπref)]\theta \leftarrow \theta + \eta \nabla_\theta \mathbb{E}[\min(r\cdot A, \mathrm{clip}(r, 1-\epsilon, 1+\epsilon)\cdot A) - \beta \cdot KL(\pi_{old}||\pi_{ref})]

with η\eta the step size, ϵ\epsilon the clipping threshold, and β\beta the KL penalty scaling (Chen et al., 29 Jan 2026).

Through this cooperative process, the segmentation and scoring subpolicies evolve to reliably delineate and evaluate logical structure, enabling the Reasoning Agent to output compressed, high-density reasoning with minimal risk of logic loss.

5. Empirical Evaluation and Ablation

SCMA has been benchmarked across multiple model and dataset scales:

  • Models: DeepSeek-R1-Distill-Qwen (1.5B, 7B), Qwen3 (4B, 8B)
  • Datasets: GSM8K, MATH500, AMC23 (2023), AIME24/25

Performance Metrics

  • Answer Accuracy (\%)
  • Average CoT Length (tokens)
Method Accuracy (%) Avg. Length (tokens) Δ Accuracy Δ Length
Vanilla 94.49 6,459
GRPO 94.92 5,877 +0.43 –9.0
LC-R1_LP 93.70 4,643 –0.79 –28.1
RL+LP 94.01 4,436 –0.48 –31.3
SCMA (ours) 94.16 3,944 +8.70 –39.0

Across all scales, SCMA achieves a 11.1%–39.0% reduction in response length and accuracy boosts of 4.33%–10.02%, outperforming both vanilla RL and single-agent length-penalty baselines in balancing brevity with correctness (Chen et al., 29 Jan 2026).

Ablation experiments reveal that freezing Segmentation/Scoring agents (i.e., using off-the-shelf Qwen or GPT-4o APIs) while training only πreason\pi_{reason} causes a drop in overall accuracy (by 0.6–3.0 points) and increased CoT length (by 3–12%), indicating the necessity of joint MARL optimization.

6. Training Dynamics and Emergent Behaviors

SCMA induces several emergent properties during training:

  • Semantic Concentration: Average chunk count decreases while mean importance scores wiw_i rise, demonstrating that Reasoning and Scoring agents co-evolve to distill logic into fewer, more substantial units.
  • Content-Adaptive Segmentation: Standard deviation in chunk lengths increases, indicating a shift from uniform chunking to semantically-driven partitions; complex computations produce longer segments, with simple transitions remaining concise.
  • Convergence Stability: Unlike single-agent RL+LP, which can collapse into “No-Think” minimalism by over-suppressing reasoning and thereby harming accuracy, SCMA’s chunk-wise modulation relaxes compression incentives once redundancy is eliminated, yielding stable policy convergence (Chen et al., 29 Jan 2026).

A qualitative study demonstrates that SCMA-generated CoT traces present only high-density logical steps—e.g., in an arithmetic problem: calculating total eggs used, updating counts, and computing revenue—eliding meta-statements and repeated verification common in GRPO baselines.

7. Theoretical and Practical Implications

SCMA operationalizes fine-grained, self-compression of reasoning traces in LRMs by integrating cooperative chunk segmentation and importance estimation into the RL training workflow. By ensuring that reasoning compression is semantically informed and adaptive, SCMA enables LRMs to attain low-latency, high-fidelity inference without any increase in deployment overhead. This approach currently demonstrates superior cost–performance tradeoffs and highlights the potential for further MARL-driven advances in efficient, robust reasoning architectures (Chen et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Compression via MARL (SCMA).