Rater Feedback Score (RFS) Analysis

Updated 23 January 2026

RFS is an open-loop, human-preference metric that evaluates predicted driving trajectories in rare, safety-critical long-tail scenarios.
It employs trust-region thresholds and exponential decay to align predictions with expert-rated candidates, addressing limitations of traditional metrics.
Empirical results from the Waymo Open Dataset demonstrate that RFS effectively differentiates multi-modal, human-aligned behaviors over conventional ADE/FDE measures.

The Rater Feedback Score (RFS) is an open-loop, human-preference-based metric for evaluating the quality of predicted driving trajectories in rare, safety-critical “long-tail” scenarios, as deployed in the Waymo Open Dataset for End-to-End Driving (WOD-E2E) and adopted in the 2025 Waymo Vision-Based E2E Driving Challenge. RFS operationalizes the alignment of a model’s predicted future trajectory with a set of three expert-rated candidate trajectories, integrating both geometric proximity and explicit human quality assessments to overcome the limitations of conventional single-mode distance metrics such as Average Displacement Error (ADE) and Final Displacement Error (FDE) (Xu et al., 30 Oct 2025, Rowe et al., 12 Jun 2025).

1. Motivation and Conceptual Rationale

RFS was introduced to address the inadequacy of prior open-loop evaluation methods in rare and multi-modal long-tail scenarios, where multiple expert-approved reactions (e.g., evasive maneuvers around unexpected debris) may be equally valid. Traditional metrics such as ADE or FDE compare predictions to a single logged ground-truth path, failing to accommodate the diversity of safe human behaviors and over-penalizing creative or unorthodox—but valid—maneuvers. RFS leverages direct human expert judgment to (i) ensure that plausible alternative behaviors are recognized and (ii) robustly penalize only those deviations that fall outside established trust regions around high-quality behaviors (Xu et al., 30 Oct 2025).

2. Mathematical Formulation

Let $\hat{\tau}(t)$ denote the model’s predicted ego position at future time $t\in\{3\,\text{s},\,5\,\text{s}\}$ . For each scenario, raters produce three reference trajectories $\{\tau_i(t)\}_{i=1}^3$ with corresponding quality scores $s_i\in[0,10]$ and initial speed $v_i$ .

Fundamental parameters:

Base trust-region thresholds:

At $t=3$ s, $\bar\tau_{\text{lat}}(3) = 1.0$ m, $\bar\tau_{\text{lng}}(3) = 4.0$ m At $t=5$ s, $\bar\tau_{\text{lat}}(5) = 1.8$ m, $\bar\tau_{\text{lng}}(5) = 7.2$ m (Longitudinal threshold = 4 × lateral threshold.)

Speed-based scaling:

$\mathrm{scale}(v)= \begin{cases} 0.5, & v<1.4 \ 0.5 + 0.5 \cdot \frac{v-1.4}{9.6}, & 1.4 \leq v < 11 \ 1.0, & v\geq11 \end{cases}$

( $v$ in m/s.)

The speed scaling interpolates threshold size from half the nominal at low speeds to full value at high speeds.

Adjusted trust-region thresholds per reference:

$\tau_{\text{lat}}^{(i)}(t) = \mathrm{scale}(v_i)\bar\tau_{\text{lat}}(t)$ , $\tau_{\text{lng}}^{(i)}(t) = \mathrm{scale}(v_i)\bar\tau_{\text{lng}}(t)$ .

Alignment error:

For each $t$ , decompose $\Delta_i(t)=\hat\tau(t)-\tau_i(t)$ into longitudinal and lateral components.

$\epsilon^{(i)}(t) = \max\left\{\frac{|\Delta_{\text{lng}}^{(i)}(t)|}{\tau_{\text{lng}}^{(i)}(t)}, \; \frac{|\Delta_{\text{lat}}^{(i)}(t)|}{\tau_{\text{lat}}^{(i)}(t)}\right\}$

Score at each timepoint:

$s_i^{\text{pred}}(t) = s_i \cdot 0.1^{\max\{\epsilon^{(i)}(t) - 1,\,0\}}$ Within the trust region ( $\epsilon \leq 1$ ) there is no penalty. Outside, the reference score is exponentially decayed (by $0.1^{\epsilon-1}$ ).

Reference-level aggregation:

For each reference $i$ , $\bar{s}_i = 0.5\left(s_i^{\text{pred}}(3) + s_i^{\text{pred}}(5)\right)$ .

Scenario-level aggregation:

$\tilde{s} = \max_{i=1,2,3} \bar{s}_i$ (best-aligned reference).

Final per-segment RFS:

$\mathrm{RFS}(\hat{\tau}) = \max\{\tilde{s},\,4.0\}$

The minimum score is floored at 4.0; no prediction can receive a lower value.

Dataset-level RFS:

For disjoint long-tail categories $C$ with scenario sets $S_c$ :

$\mathrm{RFS} = \frac{1}{|C|} \sum_{c\in C} \left( \frac{1}{|S_c|}\sum_{s\in S_c} \mathrm{RFS}^s \right)$

(Xu et al., 30 Oct 2025, Rowe et al., 12 Jun 2025).

3. Rater Annotation Protocol

The process begins with critical-moment selection, where expert raters review 20 s scenario segments and identify the earliest decision-critical frame (e.g., onset of hazards). A trajectory candidate generator (e.g., Wayformer) produces up to 64 diverse 5 s rollouts, categorized by lateral (left/straight/right) and longitudinal aggressiveness. A maximal set (≤12) is sampled for human review.

From these, raters select three reference trajectories:

Rank 1: The “best/optimal” path (score $\geq 6$ required).
Ranks 2, 3: Plausible sub-optimal alternatives (may receive lower scores).

Trajectories start with 10 points, with penalties: major infractions (Safety, Legality, Reaction Time) are $-2$ each; minor (Braking Necessity, Efficiency) are $-1$ each. Additional deductions are permitted for severe issues. Rationales are documented for all major deductions, and manual audits ensure guideline compliance. At least one reference must score $\geq 6$ (Xu et al., 30 Oct 2025).

4. Comparison with Other Metrics

RFS differs from ADE and FDE by explicitly accommodating multiple plausible behaviors via human-rated references, rewarding precise alignment within scenario-specific trust regions and penalizing outsized deviations exponentially. Unlike single-mode metrics, it provides a nuanced measure of performance in the presence of behavioral ambiguity.

Empirical results reveal only mild Pearson correlation ( $r\approx0.3$ ) between ADE and RFS across 19 models, with some methods performing well in one but not the other. This supports the distinction between geometric proximity and human preference alignment (Xu et al., 30 Oct 2025).

5. Empirical Results and Interpretation

Ablation studies confirm RFS sensitivity to design choices. In WOD-E2E, RFS consistently increases with architecture improvements: from 7.14 (single front-camera) to 7.39 (eight-camera fusion, test-time multi-sample scaling). The Poutine model achieves an RFS of 8.12 on the validation set (matching Waymo’s expert baseline of 8.13) and 7.99 on the held-out test set, securing first place in the 2025 challenge. These values indicate near-indistinguishable performance from expert demonstrations under the trust-region protocol. The RFS floor (4.0) restricts lower outlier effects, and category-level averaging prevents overfitting to frequent scenario types (Rowe et al., 12 Jun 2025, Xu et al., 30 Oct 2025).

Model	VLT Pre-train	CoT at Inference	Validation RFS
Qwen-2.5-VL 3B (no pre)	✘	✘	5.59
Qwen-2.5-VL 7B (no pre)	✘	✘	5.40
Poutine-Base (w/ pre-train)	✔︎	✘	8.12
Poutine-Base (CoT)	✔︎	✔︎	8.08

Performance interpreted as RFS $\geq8.1$ denotes expert-level proficiency. The small validation-test RFS drop (8.12→7.99) reflects strong generalization across long-tail distributions (Rowe et al., 12 Jun 2025).

6. Strengths, Limitations, and Future Extensions

Strengths

Aligns with human judgment and incorporates multi-modal acceptable behaviors.
Trust-region construct accommodates spatial tolerance, while exponential penalty ensures robustness against large errors.
Category-wise uniform averaging combats over-specialization.

Limitations

RFS requires substantial human annotation effort for each scenario.
Open-loop design does not capture closed-loop or long-horizon effects; only two timepoints (3 s, 5 s) are considered.
Component axes (safety, legality, efficiency) are used for trajectory scoring but RFS itself is a scalar—no explicit sub-scores are reported.
Certain protocol hyperparameters (e.g., exponential decay constant $\alpha$ in the Poutine implementation) remain undisclosed and are challenge-specific, limiting external reproducibility (Rowe et al., 12 Jun 2025, Xu et al., 30 Oct 2025).

Future Extensions

Proposed enhancements include automation of candidate trajectory generation and scoring, extension to continuous or additional timepoints, hybridization with closed-loop simulation, and expansion to multi-agent trajectory judgment (Xu et al., 30 Oct 2025).

7. Significance and Adoption

RFS exemplifies a shift toward preference-aligned, scenario-sensitive evaluation of autonomous driving agents, particularly in domains where geometric or rule-based metrics are insufficient. It serves as the primary evaluation metric for the WOD-E2E benchmark and the 2025 Waymo challenge. Its design has led to high leaderboard selectivity: top-performing models such as Poutine outperform strong baselines by significant margins (e.g., 7.99 vs. <7.9 RFS on held-out test), suggesting meaningful discriminative power on critical real-world rare scenarios (Rowe et al., 12 Jun 2025, Xu et al., 30 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (2)

WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios (2025)

Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rater Feedback Score (RFS).