Papers
Topics
Authors
Recent
Search
2000 character limit reached

ArmoRM+MoE: Interpretable Reward Modeling

Updated 20 February 2026
  • The paper introduces ArmoRM+MoE, a two-stage framework that decouples interpretable reward modeling via linear probing and dynamic MoE gating.
  • It leverages a frozen Llama-3 8B backbone and a regression head to map prompt-response pairs to 19 human-interpretable scores with debiasing for verbosity.
  • Empirical results on RewardBench demonstrate near state-of-the-art performance, offering transparent insights into objective weighting for safety, correctness, and more.

ArmoRM+MoE is a two-stage framework for interpretable, prompt-conditioned reward modeling in LLM alignment. This approach addresses the opacity of conventional black-box reward models (RMs) in RLHF by making sub-scores for human-interpretable objectives explicit and enabling dynamic weight adjustment per prompt context. The method achieves near-state-of-the-art performance on the RewardBench benchmark using only an 8B-parameter Llama-3 backbone, while providing granular insight into model preferences and choices (Wang et al., 2024).

1. Absolute-Rating Multi-Objective Reward Model Construction

ArmoRM is trained on multi-dimensional absolute ratings instead of traditional pairwise preferences. For each prompt-response tuple (x,y)(x, y), a vector of absolute scores r=[r1,...,rk]r = [r_1, ..., r_k]^\top is provided, with each ri[0,1]r_i \in [0,1] representing a human-interpretable objective such as honesty, correctness, verbosity, safety, instruction-following, or code-readability. The datasets incorporated for label acquisition include HelpSteer, UltraFeedback, BeaverTails, CodeUltraFeedback, Prometheus, Argilla‐Capybara, Argilla-Math, among others. Collectively, ArmoRM sees 19 objectives with approximately 600,000 absolute-rating examples, where each task has its own rubric.

For model architecture, a frozen Llama-3 8B transformer is used as a feature extractor fθf_\theta. The concatenated prompt and response (xy)(x \oplus y) pass through fθf_\theta, producing the final token hidden state hRdh \in \mathbb{R}^d. A regression head WRd×kW \in \mathbb{R}^{d \times k} maps hh to sub-score predictions y^=WhRk\hat{y} = W^\top h \in \mathbb{R}^k. The regression head WW is trained via mean squared error only on present labels for each example:

minW E(x,y,r)Wfθ(xy)r22.\min_{W} \ \mathbb{E}_{(x, y, r)} \| W^\top f_\theta(x \oplus y) - r \|_2^2.

This design enables explicit supervision for each interpretable axis available per example.

2. Mixture-of-Experts Gating and Prompt-Specific Scalarization

To integrate the k-dimensional output of ArmoRM into a scalar reward RR suitable for ranking or PPO, a fixed linear aggregation is insufficiently flexible. Instead, a Mixture-of-Experts (MoE) gating network gϕg_\phi dynamically selects a convex combination v(x)Δk1v(x) \in \Delta^{k-1} over objectives based solely on the prompt feature fθ(x)f_\theta(x). The gating network is a three-layer ReLU MLP with 1024 units per layer and a terminal softmax layer:

v(x)=softmax(MLPϕ(fθ(x)))Rkv(x) = \mathrm{softmax}( \mathrm{MLP}_\phi(f_\theta(x)) ) \in \mathbb{R}^k

with ivi=1, vi0\sum_i v_i = 1,\ v_i \geq 0.

Substantial attention is given to verbosity debiasing, as many objectives correlate strongly with response length. For each target objective, the linear correlation with verbosity is removed:

ri=riλirverboser'_i = r_i - \lambda_i \cdot r_\text{verbose}

via λi\lambda_i such that the debiased objectives are uncorrelated with verbosity on a held-out UltraFeedback reference set.

The scalar reward becomes

R(x,y)=v(x)r(x,y)R(x, y) = v(x)^\top r'(x, y)

where r(x,y)r'(x, y) contains the debiased sub-scores.

3. Two-Stage Training Protocol

Training proceeds in two decoupled phases:

  1. ArmoRM linear probing: The 8B Llama-3 backbone is frozen and features h=fθ(xy)h = f_\theta(x \oplus y) are precomputed. The regression head WW is optimized using a multi-output least-squares regression solver (e.g., scikit-learn CPU backend).
  2. MoE gating optimization: fθf_\theta and WW remain fixed. The gating network gϕg_\phi is trained with a Bradley-Terry loss using pairwise comparisons from ten preference datasets. Loss is given by

minϕ,βE[log(exp(βRchosen)exp(βRchosen)+exp(βRrejected))]\min_{\phi, \beta} -\mathbb{E}\left[ \log\left( \frac{\exp(\beta R_\text{chosen})}{\exp(\beta R_\text{chosen}) + \exp(\beta R_\text{rejected})} \right) \right]

with scalar temperature β\beta initialized to 100. Training proceeds for 10,000 steps on an A6000 GPU using AdamW (learning rate 1×1031 \times 10^{-3}, batch size 1024, cosine learning rate decay).

4. Empirical Performance on RewardBench

RewardBench tests reward models on their ability to correctly rank preferred over rejected responses in five categories: Chat, Chat-Hard, Safety, Reasoning (each with weight 1.0), and Prior-Sets (weight 0.5). Major comparative results:

Model RewardBench Weighted Accuracy (%)
Nemotron-4 340B (HelpSteer2 RM) 89.3
ArmoRM + MoE (Llama-3 8B) 89.0
HelpSteer2 RM on Llama-3 70B 86.3
Bradley-Terry RM on Llama-3 8B (backbone) 83.6
LLM-as-a-judge (GPT-4 Turbo) 84.2
LLM-as-a-judge (GPT-4o) 83.3

ArmoRM+MoE matches the 340B baseline on Safety and Prior-Sets and excels at Chat and Reasoning sub-tasks. Notably, the performance of ArmoRM+MoE (8B) significantly exceeds the LLM-as-a-judge paradigm as implemented with GPT-4 judges.

5. Interpretability and Practical Auditing

The MoE gating network outputs a per-prompt vector v(x)v(x) revealing the weight assigned to each human-centric objective. This design enables direct auditing and interpretability:

  • On safety-critical prompts, v(x)v(x) assigns 70% or more weight to "is-safe," with minimal allocation to verbosity.
  • For mathematical questions, the mass shifts to correctness, truthfulness, and instruction-following.

Practitioners can inspect v(x)v(x) to diagnose unintended emphasis (such as over-weighting verbosity), and optionally steer model behavior by clamping specific weights to zero. This provides a direct mechanism to check alignments, investigate model errors, and mitigate reward hacking.

6. Limitations and Open Questions

Several limitations and areas for future work are identified:

  • Non-joint training: The regression head WW and gating network gϕg_\phi are learned sequentially; joint fine-tuning might improve overall alignment.
  • Handling of missing labels: Many data points lack complete coverage of all kk objectives; current training ignores missing dimensions without imputation. A principled approach to missingness may enhance robustness and interpretability.
  • Prompt-only gating: The gating network considers only the prompt, not the response. For some objectives (e.g., factuality), conditioning on the response may be beneficial.
  • Static debiasing: Verbosity de-correlation uses fixed λi\lambda_i and a reference distribution DD; adaptivity to novel domains is not addressed.
  • Human oversight effect: While v(x)v(x)'s transparency makes the model auditable, no user studies confirm improvement in human oversight or reduction of reward hacking.

A plausible implication is that future research incorporating joint fine-tuning, dynamic debiasing strategies, or response-aware gating may further close the performance gap to larger models and enhance trust in RM outputs.

7. Summary and Context

ArmoRM+MoE operationalizes interpretable multi-objective reward modeling by decoupling axes of human preference and introducing prompt-specific, auditable scalarization. The framework demonstrates that a relatively compact 8B model suffices for high reward modeling accuracy, rivaling massive baselines while supporting fine-grained scrutiny of alignment. These results position ArmoRM+MoE as a candidate methodology for transparent and trustworthy LLM reward modeling, encouraging further work on joint learning, adaptive debiasing, and human-centered evaluation (Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ArmoRM+MoE.