Omni-RRM: Multimodal Rubric Reward Model

Updated 7 February 2026

Omni-RRM is a multimodal framework that redefines reward modeling by generating structured, rubric-grounded rationales across text, image, video, and audio.
It leverages the automated Omni-Preference dataset and a two-stage training process (SFT and GRPO) to reliably improve model discrimination, especially on challenging pairs.
Evaluations reveal significant accuracy gains and enhanced transparency over conventional scalar rewards, enabling robust cross-modal alignment.

Omni-RRM (Omni-Modal Rubric-Grounded Reward Model) is an open-source framework for multimodal reward modeling that generates structured, interpretable, and modality-aware preference judgments across text, image, video, and audio. By replacing opaque scalar scores with multi-dimensional, rubric-grounded comparative rationales, Omni-RRM introduces an automated and scalable alternative to human annotation, advancing the state-of-the-art in alignment for multimodal LLMs (MLLMs) (Kong et al., 31 Jan 2026).

1. Motivation and Conceptual Foundation

Contemporary MLLMs exhibit strong generative capabilities but are constrained by the limitations of prevailing reward models, which are primarily vision-centric, produce only scalar assessments, and are heavily dependent on human-labeled training data. This leads to insufficient support for less-served modalities (notably audio and video), limited error interpretability, and poor scalability to new domains. The Omni-RRM framework reframes reward modeling as a principal–agent problem, where nuanced human preferences are difficult to encode via simplistic scalar rewards. Omni-RRM addresses these shortcomings by generating, for any context and candidate pair, a detailed and explainable comparative judgment consisting of:

A rubric-grounded rationale with five shared evaluation dimensions (fluency, relevance, accuracy, reasoning quality, safety), including explicit, modality-specific comparative evidence.
A final preference verdict that directly aligns with the dimension-wise judgments.

This design transforms reward modeling from black-box regression into interpretable generation, allowing for fine-grained auditability and robust alignment across modalities.

2. Automated Dataset Construction: Omni-Preference

Omni-RRM leverages the Omni-Preference dataset, which comprises approximately 41,000 carefully filtered, high-confidence preference pairs, each annotated with a five-dimension rubric rationale and verdict—all generated without direct human labeling (Kong et al., 31 Jan 2026). The construction pipeline involves:

Automated Candidate Generation: Diverse context–question pairs are sampled from standard multimodal benchmarks (e.g., ActivityNet, Clotho-AQA). Two models of differing capability ("strong" and "weak" generators) are used to produce alternative responses for each prompt.
Rubric Annotation and Filtering: Each response pair is evaluated independently by two teacher models (GPT-4o-mini and Doubao-1.5-Pro). Teachers output dimension-wise justifications, integer scores (in [0,10]), and a categorical verdict, all under rubric constraints in a structured JSON schema. Teacher judgments are reconciled: only pairs with aligned verdicts and consistent score ordering are retained, while rule-based filters remove malformed or low-information samples.
Difficulty Stratification: The dataset distinguishes "Hard" pairs (low score margins) from "Easy" pairs (high score margins), supporting nuanced assessment of model discrimination.

The resulting dataset provides comprehensive, rubric-grounded preference signals spanning text, image, video, and audio.

3. Model Structure and Training Methodology

Omni-RRM implements a two-stage training paradigm:

Supervised Fine-Tuning (SFT): Built atop the Qwen-2.5-Omni backbone with LoRA adapters, the model is first fine-tuned to generate schema-compliant JSON rationales given input context and response pairs. Training minimizes the negative log-likelihood of the target rationale-verdict pairs. After two epochs, the model produces reliable five-dimension critiques.
Group Relative Policy Optimization (GRPO): To enhance sensitivity, especially on low-contrast (difficult) preference pairs, the model undergoes reinforcement learning with a GRPO objective. For each context, $k$ $k$ candidate responses are generated and scored via a composite reward:
- $R_{\mathrm{fmt}}$ : Strict schema validity (+1 or –1).
- $R_{\mathrm{pref}}$ : Preference correctness and consistency.
- $R_{\mathrm{rub}}$ : Full rubric coverage and explicit A-vs-B contrasts.

A group-normalized advantage is computed for each candidate and used to update the policy with a clipped PPO-style objective and KL penalty to the SFT reference policy.

This regimen first builds the model’s capacity to internalize and output rubric-grounded rationales, then sharpens its ability to discriminate between closely matched alternatives.

4. Evaluation and Benchmark Results

Omni-RRM is evaluated on three major modalities using established preference benchmarks:

Modality	Benchmark(s)	Accuracy (%)	Absolute Gain over Backbone (%)
Image	VL-RewardBench, MM-RewardBench	67.1, 72.9	+16.1, +26.8
Video	ShareGPT-V	80.2	+21.0
Audio	Audio-HH-RLHF	66.8	+7.1
Overall	Mean	71.8	+17.7

Results are aggregated over three random seeds, with standard deviations under 0.5 percentage points. Omni-RRM matches or surpasses proprietary LLMs (e.g., Gemini-2.5-Pro) and outperforms open RMs such as LLaVA-Critic and UnifiedReward-Think. Statistical analysis reveals that improvements concentrate on hard pairs (average +12.0 pp gain on hard vs. +6.6 pp on easy), directly confirming the effectiveness of the GRPO phase in enhancing nuanced discrimination (Kong et al., 31 Jan 2026).

5. Ablation and Comparative Analysis

Key ablation studies yield the following findings:

Rubric Grounding: Ablating dimension-wise justification (reducing to scalar rewards) reduces overall accuracy from 71.8% to 64.4%, demonstrating the vital inductive bias provided by structured rationales.
Cross-Modal Transfer: Training on all modalities results in superior performance; removing any modality degrades results across the spectrum, highlighting that shared rubric signals enable effective cross-modal transfer.
Data Quality and Scale: When compared with an equal-sized public preference set (LLaVA-Critic), the rubric-grounded Omni-Preference pipeline achieves +9.2 pp increased accuracy on VL-RewardBench, underlining the higher signal quality.

This suggests that richly structured, automatically synthesized preference data can outperform larger but less informative collections.

6. Interpretability, Transparency, and Modality Support

Omni-RRM’s model outputs comprise fully structured JSON rationales, with explicit contrastive textual judgments for both candidates along the five rubric dimensions:

{
  "fluency":    "...",
  "relevance":  "...",
  "accuracy":   "...",
  "reasoning":  "...",
  "safety":     "...",
  "better":     "A" or "B"
}

For each candidate pair, this schema enables direct auditing of the rationale underlying preference decisions, facilitating error diagnosis (e.g., detection of object hallucination, temporal slip, or content mistranscription). The design is modality-general, readily absorbing new modalities or domains through rubric extension. By producing justifications instead of mere scalars, it enables transparent supervision and addresses longstanding deficiencies in the interpretability of reward modeling for multimodal AI.

7. Future Directions and Open Challenges

Planned advancements for Omni-RRM include:

Extending coverage to real-world audio (e.g., non-synthetic recordings with noise and speaker variation).
Scaling the dataset to more comprehensive, diverse domains.
Exploring dynamic rubrics that adapt to evolving application requirements.

The core methodology embodied by Omni-RRM—zero-human-annotation pipeline, rubric-grounded judgment, and progressive optimization—directly addresses modality imbalance and lack of interpretability, paving the way for more reliable, trustworthy AI alignment in unconstrained, multimodal environments (Kong et al., 31 Jan 2026).

For context, Omni-RRM is complemented by related efforts such as Omni-Reward, which pursues generalist reward modeling through free-form criteria and a broader set of modalities, though with a different focus on user-specified criteria and larger, externally constructed datasets (Jin et al., 27 Oct 2025). Both frameworks represent significant advances toward scalable, adaptable, and value-aligned reward modeling for next-generation multimodal agents.

Markdown Report Issue Upgrade to Chat

References (2)

Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis (2026)

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Omni-RRM.