ArmoRM+MoE: Interpretable Reward Modeling
- The paper introduces ArmoRM+MoE, a two-stage framework that decouples interpretable reward modeling via linear probing and dynamic MoE gating.
- It leverages a frozen Llama-3 8B backbone and a regression head to map prompt-response pairs to 19 human-interpretable scores with debiasing for verbosity.
- Empirical results on RewardBench demonstrate near state-of-the-art performance, offering transparent insights into objective weighting for safety, correctness, and more.
ArmoRM+MoE is a two-stage framework for interpretable, prompt-conditioned reward modeling in LLM alignment. This approach addresses the opacity of conventional black-box reward models (RMs) in RLHF by making sub-scores for human-interpretable objectives explicit and enabling dynamic weight adjustment per prompt context. The method achieves near-state-of-the-art performance on the RewardBench benchmark using only an 8B-parameter Llama-3 backbone, while providing granular insight into model preferences and choices (Wang et al., 2024).
1. Absolute-Rating Multi-Objective Reward Model Construction
ArmoRM is trained on multi-dimensional absolute ratings instead of traditional pairwise preferences. For each prompt-response tuple , a vector of absolute scores is provided, with each representing a human-interpretable objective such as honesty, correctness, verbosity, safety, instruction-following, or code-readability. The datasets incorporated for label acquisition include HelpSteer, UltraFeedback, BeaverTails, CodeUltraFeedback, Prometheus, Argilla‐Capybara, Argilla-Math, among others. Collectively, ArmoRM sees 19 objectives with approximately 600,000 absolute-rating examples, where each task has its own rubric.
For model architecture, a frozen Llama-3 8B transformer is used as a feature extractor . The concatenated prompt and response pass through , producing the final token hidden state . A regression head maps to sub-score predictions . The regression head is trained via mean squared error only on present labels for each example:
This design enables explicit supervision for each interpretable axis available per example.
2. Mixture-of-Experts Gating and Prompt-Specific Scalarization
To integrate the k-dimensional output of ArmoRM into a scalar reward suitable for ranking or PPO, a fixed linear aggregation is insufficiently flexible. Instead, a Mixture-of-Experts (MoE) gating network dynamically selects a convex combination over objectives based solely on the prompt feature . The gating network is a three-layer ReLU MLP with 1024 units per layer and a terminal softmax layer:
with .
Substantial attention is given to verbosity debiasing, as many objectives correlate strongly with response length. For each target objective, the linear correlation with verbosity is removed:
via such that the debiased objectives are uncorrelated with verbosity on a held-out UltraFeedback reference set.
The scalar reward becomes
where contains the debiased sub-scores.
3. Two-Stage Training Protocol
Training proceeds in two decoupled phases:
- ArmoRM linear probing: The 8B Llama-3 backbone is frozen and features are precomputed. The regression head is optimized using a multi-output least-squares regression solver (e.g., scikit-learn CPU backend).
- MoE gating optimization: and remain fixed. The gating network is trained with a Bradley-Terry loss using pairwise comparisons from ten preference datasets. Loss is given by
with scalar temperature initialized to 100. Training proceeds for 10,000 steps on an A6000 GPU using AdamW (learning rate , batch size 1024, cosine learning rate decay).
4. Empirical Performance on RewardBench
RewardBench tests reward models on their ability to correctly rank preferred over rejected responses in five categories: Chat, Chat-Hard, Safety, Reasoning (each with weight 1.0), and Prior-Sets (weight 0.5). Major comparative results:
| Model | RewardBench Weighted Accuracy (%) |
|---|---|
| Nemotron-4 340B (HelpSteer2 RM) | 89.3 |
| ArmoRM + MoE (Llama-3 8B) | 89.0 |
| HelpSteer2 RM on Llama-3 70B | 86.3 |
| Bradley-Terry RM on Llama-3 8B (backbone) | 83.6 |
| LLM-as-a-judge (GPT-4 Turbo) | 84.2 |
| LLM-as-a-judge (GPT-4o) | 83.3 |
ArmoRM+MoE matches the 340B baseline on Safety and Prior-Sets and excels at Chat and Reasoning sub-tasks. Notably, the performance of ArmoRM+MoE (8B) significantly exceeds the LLM-as-a-judge paradigm as implemented with GPT-4 judges.
5. Interpretability and Practical Auditing
The MoE gating network outputs a per-prompt vector revealing the weight assigned to each human-centric objective. This design enables direct auditing and interpretability:
- On safety-critical prompts, assigns 70% or more weight to "is-safe," with minimal allocation to verbosity.
- For mathematical questions, the mass shifts to correctness, truthfulness, and instruction-following.
Practitioners can inspect to diagnose unintended emphasis (such as over-weighting verbosity), and optionally steer model behavior by clamping specific weights to zero. This provides a direct mechanism to check alignments, investigate model errors, and mitigate reward hacking.
6. Limitations and Open Questions
Several limitations and areas for future work are identified:
- Non-joint training: The regression head and gating network are learned sequentially; joint fine-tuning might improve overall alignment.
- Handling of missing labels: Many data points lack complete coverage of all objectives; current training ignores missing dimensions without imputation. A principled approach to missingness may enhance robustness and interpretability.
- Prompt-only gating: The gating network considers only the prompt, not the response. For some objectives (e.g., factuality), conditioning on the response may be beneficial.
- Static debiasing: Verbosity de-correlation uses fixed and a reference distribution ; adaptivity to novel domains is not addressed.
- Human oversight effect: While 's transparency makes the model auditable, no user studies confirm improvement in human oversight or reduction of reward hacking.
A plausible implication is that future research incorporating joint fine-tuning, dynamic debiasing strategies, or response-aware gating may further close the performance gap to larger models and enhance trust in RM outputs.
7. Summary and Context
ArmoRM+MoE operationalizes interpretable multi-objective reward modeling by decoupling axes of human preference and introducing prompt-specific, auditable scalarization. The framework demonstrates that a relatively compact 8B model suffices for high reward modeling accuracy, rivaling massive baselines while supporting fine-grained scrutiny of alignment. These results position ArmoRM+MoE as a candidate methodology for transparent and trustworthy LLM reward modeling, encouraging further work on joint learning, adaptive debiasing, and human-centered evaluation (Wang et al., 2024).