Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning

Published 26 Jan 2026 in cs.CL and cs.LG | (2601.18722v1)

Abstract: When asked a question in a language less seen in its training data, current reasoning LLMs (RLMs) often exhibit dramatically lower performance than when asked the same question in English. In response, we introduce \texttt{SP3F} (Self-Play with Privileged Pairwise Feedback), a two-stage framework for enhancing multilingual reasoning without \textit{any} data in the target language(s). First, we supervise fine-tune (SFT) on translated versions of English question-answer pairs to raise base model correctness. Second, we perform RL with feedback from a pairwise judge in a self-play fashion, with the judge receiving the English reference response as \textit{privileged information}. Thus, even when none of the model's responses are completely correct, the privileged pairwise judge can still tell which response is better. End-to-end, \texttt{SP3F} greatly improves base model performance, even outperforming fully post-trained models on multiple math and non-math tasks with less than of the training data across the single-language, multilingual, and generalization to unseen language settings.

Summary

  • The paper introduces SP3F, a novel two-stage framework combining supervised fine-tuning on translated responses with reinforcement learning guided by privileged pairwise judges.
  • It employs a unique method where an LLM judge, with privileged access to English reference answers, reliably rewards correct reasoning chains even in low-resource languages.
  • The approach achieves significant gains, with over 3x accuracy improvements in languages like Indonesian and Bengali while using only 1/8 of the typical training data.

SP3F: Advancing Data-Efficient Multilingual Reasoning with Privileged Pairwise Judges

Introduction

Reasoning performance of LLMs often degrades precipitously when deployed in lower-resource languages due to insufficient data and weak cross-lingual transfer from predominantly English-centric training pipelines. The paper "Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning" (2601.18722) introduces SP3F (Self-Play with Privileged Pairwise Feedback), a two-stage post-training framework that improves LLM reasoning performance for target languages even when no data in those languages is available. SP3F combines supervised finetuning (SFT) over translated reference responses and reinforcement learning (RL) guided by a privileged LLM judge leveraging English source answers for feedback.

The SP3F Framework

SP3F comprises two algorithmic phases:

1. SFT on Translated Reference Responses:

Given plentiful English question-answer pairs (x,y)(x, y^*), both questions and reference solutions are translated into the target language. SFT maximizes the likelihood of generating the translated solution under the model given a translated question, facilitating better mode selection for subsequent RL. The translation uses automated systems and the approach is applied single-language or across a multilingual set.

2. RL with Privileged Pairwise Judge Feedback:

RL updates are driven by a composite reward, integrating standard verifiable signals (correctness, formatting, language fidelity) and preference judgments from an LLM "judge" possessing privileged access to the original English reference response. For each batched prompt, multiple candidate responses are sampled, and the judge is tasked with selecting the response closest in reasoning and correctness to the reference. This setup enhances reliability in rewarding correct chains-of-thought (CoT), even when the final answer is incorrect—a crucial step for mitigating the cold-start problem in low-resource languages. Figure 1

Figure 1: SP3F outperforms Qwen2.5-7B-Instruct across 4 tasks, achieving higher accuracy and language fidelity while using only 18\frac{1}{8} as much training data.

Figure 2

Figure 2: The RL pipeline uses a pairwise judge informed by English reference responses, comparing NN outputs per prompt and rewarding based on empirical win-rate.

Experimental Setup

SP3F-7B is derived from Qwen2.5-7B, trained using questions translated into 18 different languages and the corresponding solutions. Datasets include both in-domain (math: MGSM, MT-Math100) and out-of-domain (Belebele, Global MMLU Lite) benchmarks. Evaluation metrics are accuracy (final answer correctness) and language fidelity (fraction of response in target language). Baselines include Qwen2.5-7B-Instruct (1M post-training examples) and "Translate Test" (inference-time translation boosting).

Results: Gains of SP3F

Single-Language and Multilingual Performance

SP3F demonstrates marked improvements over post-trained models—especially in lower-resource languages (e.g., Indonesian, Swahili, Bengali), where accuracy multipliers exceeding 3×3\times over Qwen2.5-7B-Instruct are observed. Both SFT and RL phases are essential; SFT alone raises baseline accuracy, but RL with judge feedback brings robust reasoning gains. Figure 3

Figure 3: SP3F trained in a single language outperforms Qwen2.5-7B-Instruct in all tasks; performance deltas are largest in low-resource languages.

Figure 4

Figure 4: SP3F-7B gains over Qwen2.5-7B-Instruct are particularly pronounced in math reasoning and out-of-domain tasks for lower-resource languages.

SP3F-7B, trained with multilingual data and SP3F pipeline, consistently surpasses Qwen2.5-7B-Instruct across all in-language and cross-lingual benchmarks—achieving higher scores with $1/8$th the post-training data. Translate Test inference-time boosts do not close this gap.

Generalization to Unseen Languages

Crucially, SP3F-7B generalizes to unseen languages beyond its training set, outperforming Qwen2.5-7B-Instruct with an average absolute improvement of 18%18\% in reading comprehension and 3.4%3.4\% in math accuracy.

Privileged Pairwise Judges: Technical Insights

Improving Judge Reliability and Reducing Intransitivity

Providing privileged information (English reference responses) to the LLM judge reduces noise and intransitivity in feedback. Intransitive preferences of LLM judges—where circular ranking of responses undermines reward modeling—are diminished, and the judge achieves better detection of correct CoT even if final answers are incorrect. Figure 5

Figure 5: Privileged judges demonstrate lower intransitivity compared to non-privileged judges, yielding more reliable pairwise preference orderings.

Privileged feedback especially aids the early RL cold start, when correct reasoning chains are sparse. Empirical ablations show privileged judges are better at ranking responses with correct intermediate steps, thus increasing sample efficiency and robustness in model updates. Figure 6

Figure 6: Models trained with non-privileged judge feedback still prefer generations of models trained with privileged feedback, implying broad qualitative improvements.

Training with privileged judges yields superior models in both in-domain and (to a lesser extent) out-of-domain tasks. The effect is validated by comparative pairwise evaluations—models trained with privileged feedback are favored even by non-privileged judges, indicating generalized improvements rather than narrow reward optimization.

Implications and Future Directions

SP3F substantiates that (1) data-efficient cross-lingual reasoning can be unlocked by leveraging privileged information in RL reward modeling, and (2) pairwise self-play protocols are robust against LLM judge intransitivity, allowing learning to be driven by dense, reliable relative feedback. By requiring only English reference responses, SP3F makes it feasible to bring high reasoning fidelity to resource-poor languages, advancing the practical deployment of LLMs globally. The demonstrated generalization to unseen languages and tasks, alongside robust sample efficiency, indicates significant promise for scaling multilingual reasoning without prohibitive data curation costs.

Potential future extensions include broader adoption of privileged pairwise judges in other RL fine-tuning paradigms, integration with advanced regularization for further domain generalization, and tighter analysis of judge composition for optimal feedback reliability.

Conclusion

The SP3F framework sets forth a principled, efficient, and broadly applicable methodology for post-training LLMs for multilingual reasoning in the absence of target-language data. By reusing English reference solutions in both SFT and privileged RL feedback, SP3F-7B models outperform considerably larger baselines with drastically less data. Theoretical justifications and extensive empirical analysis of privileged judges elucidate their advantages in feedback consistency and sample efficiency. These insights will inform future developments in multilingual LLM reasoning and RL-based model post-training.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper is about helping AI LLMs think and solve problems well in many different languages, not just English. The authors introduce a training method called SP3F (Self-Play with Privileged Pairwise Feedback) that boosts reasoning skills in languages where there isn’t much training data (like Swahili or Bengali). The key trick: use English solutions as a “secret answer key” to guide a judge that compares the model’s responses in other languages, so the model gets useful feedback even when it’s not fully correct yet.

What questions does the paper ask?

  • Why do AI models do much worse when you ask them questions in non-English languages?
  • How can we teach an AI to reason well in those languages without collecting lots of new data in each language?
  • Can a special “judge” that sees the English answer help train the model to reason better in other languages?
  • Will this approach work across many languages, and even on languages the model wasn’t trained on?

How does the method work?

The training happens in two stages. Think of it like sports practice, then scrimmage games with a referee:

  1. Supervised practice (SFT):
  • The model practices by reading problems and good solutions that were originally in English.
  • These English problems and solutions are translated into the target languages (like Indonesian or Hindi) using a translator tool.
  • The model learns to produce step-by-step reasoning in the target language by copying these translated “good examples.”
  • Analogy: it’s like learning by studying worked examples in your native language, even if those examples came from an English textbook.
  1. Self-play with a privileged judge (RL):
  • The model now generates several different answers to the same problem in the target language.
  • A “judge” AI compares pairs of these answers and decides which one is better.
  • The judge gets “privileged” information: it can see the original English reference solution (the answer key), even though the model cannot.
  • Because the judge can compare each non-English answer to the trusted English solution, it can still pick the best attempt, even if none are perfect.
  • The model earns reward points for:
    • Getting the final answer right (accuracy),
    • Using the expected format (like putting the final result in a box),
    • Responding mostly in the target language (language fidelity),
    • Winning more pairwise matchups (the judge prefers it more often than other samples).
  • Analogy: the model plays mini-matches against its own answers. The referee has the answer key in English and awards points to the attempt that’s closest to correct. The model learns to make better attempts over time.

Two extra ideas make this robust:

  • Pairwise comparisons and win-rate: instead of trying to assign a single “score” to each answer, the judge just decides which of two answers is better, many times. The model’s reward is how often its answer “wins.” This helps when judges are inconsistent.
  • Privileged information: giving the judge the English answer improves judging quality, especially in lower-resource languages, and helps the judge recognize good reasoning steps even when the final answer is wrong.

What did they test, and what did they find?

They trained a 7-billion-parameter model (SP3F-7B) using this method and compared it to a strong baseline model (Qwen2.5-7B-Instruct) that was trained on a much bigger dataset.

Tasks:

  • In-domain math tasks (the training used math problems):
    • MGSM: simple multilingual math word problems.
    • MT-Math100: more demanding math, translated into multiple languages.
  • Out-of-domain non-math tasks (to test generalization):
    • Belebele: reading comprehension across many languages.
    • Global MMLU Lite: general knowledge questions.

Key results (accuracy and answering in the right language):

  • SP3F-7B beat the baseline across math and non-math tasks in many languages.
  • It did this with about 1/8 the amount of training data (about 125k examples vs. 1,000,000).
  • Improvements were especially large in lower-resource languages (like Swahili, Bengali, Indonesian).
  • It also generalized to languages it wasn’t trained on, improving average accuracy by about +18% on reading comprehension (Belebele) and +3.4% on math (MT-Math100) compared to the baseline.

Why these results matter:

  • The privileged judge made better decisions and was more consistent.
  • It helped the model improve its step-by-step thinking in the target language, not just the final answer.
  • The method is data-efficient, since it relies on English solutions and machine translation rather than collecting new target-language datasets.

Why is this important?

  • Fair access: AI that reasons well in many languages can help more people, especially those who don’t primarily use English.
  • Data efficiency: You don’t need to gather tons of new data in every language; you can use English solutions and translation plus a smart judge.
  • Better reasoning: The model gets guidance on its thinking steps, not just whether the final answer is right. This helps learning early on when the model often gets answers wrong.
  • Robust judging: Pairwise comparisons and the English “answer key” make the judge more reliable, even across different languages.

What could this change in the future?

  • Smarter multilingual helpers: AI assistants could become better at math, reading comprehension, and general knowledge in many languages, including those with less online data.
  • Broader applications: The “privileged judge” idea could be used beyond languages—for example, in science or coding—where a trusted reference can help guide learning.
  • More inclusive AI training: Since this method doesn’t depend on huge target-language datasets, it could quickly improve performance in new or underrepresented languages.
  • Better generalization: Training with pairwise judges and verifiable rewards may build models that transfer their skills to new tasks and languages more reliably.

Overall, the paper shows a clever way to “borrow strength” from English solutions to teach AI models to reason well in other languages, efficiently and effectively.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, rendered as concrete items to guide future research.

  • Data quality and translation fidelity:
    • The SFT stage relies on machine-translated queries and responses (GPT‑5‑Nano) across 18 languages with no reported quality checks beyond a single Indonesian MGSM subset; quantify translation error rates and their impact on downstream training.
    • Assess whether translation artifacts (idiom literalization, syntactic calques, tokenization issues in non-Latin scripts, RTL layout) bias learning, and develop robust preprocessing/filters.
    • Evaluate how noisy or misaligned English reference responses (teacher errors or domain drift) propagate through privileged judging and RL.
  • Judge reliability and biases:
    • The privileged judge is GPT‑4o‑mini (closed model); measure judge accuracy/calibration per language and task, and compare against open judges to improve reproducibility.
    • Beyond positional bias (mitigated by averaging), analyze and correct length bias, verbosity bias, and formatting bias in pairwise judgments.
    • Provide formal quantification of how reductions in intransitivity (PNT metric) correlate with policy improvements; is reduced intransitivity necessary/sufficient?
    • Establish statistical confidence in judge comparisons (e.g., bootstrap over pairwise outcomes) and link judge uncertainty to RL objective weighting.
    • Investigate failure modes where “closest to English reference” pushes literal translation over correct reasoning or culturally appropriate expression in the target language.
  • Reward design and generalization:
    • The language fidelity reward uses a binary 70% threshold via lingua; test alternative detectors, thresholds, and code-switch-aware metrics, especially for morphologically rich and low-resource languages.
    • The formatting reward (\boxed{}) is math-centric; define task-specific verifiable rewards for non-math tasks and measure sensitivity to formatting choices.
    • Explore process-level/verifier rewards that grade intermediate CoT steps (e.g., step consistency checks) to reduce reliance on sparse end-answer rewards.
    • Systematically ablate the weights among the four reward components (accuracy, formatting, language, judge win-rate) to understand trade-offs and avoid mode selection.
  • Scope and evaluation coverage:
    • Training data is math-heavy (DeepScaleR); quantify how SP3F scales to other reasoning domains (code, formal logic, scientific QA), and whether privileged judging remains beneficial.
    • Out-of-domain performance shows mixed results relative to RLVR; identify causes (overfitting to math-style CoT? reward misalignment?) and test regularization strategies.
    • Unseen language evaluation covers only eight languages and limited typological diversity; extend to scripts/languages not represented (e.g., Arabic, Thai, Khmer, Amharic, Burmese, Lao).
    • Report confidence intervals/statistical significance for accuracy and language fidelity across tasks and languages to substantiate improvements.
  • Computational efficiency and scalability:
    • Pairwise judging incurs O(N2) comparisons per prompt (N=8 yields 28 pairs); profile end-to-end compute and propose scalable approximations (e.g., tournament brackets, subset sampling, learned Bradley–Terry models).
    • Analyze sensitivity of results to N (number of samples per prompt), batch size, and training steps; provide guidance for compute-constrained settings.
    • Compare overall training compute (SFT+RL+judging) against the baseline (1M example post-training) to validate the “1/8 data” claim in terms of actual FLOPs and wall-clock.
  • Multilingual training strategy:
    • The multilingual mixture is equal-proportion across 18 languages; study curriculum/weighting strategies (e.g., temperature sampling, resource-aware weighting) and their effect on low-resource gains and high-resource retention.
    • Examine language interference and catastrophic forgetting (e.g., English and high-resource languages) after SP3F; report pre/post performance in English and other major languages.
    • Test whether privileged information from languages other than English (e.g., high-resource hinge languages like Spanish, Mandarin) works equally well or reduces judge bias.
  • Robustness and safety:
    • Evaluate robustness to adversarial prompts (orthographic noise, mixed scripts, code-switching) and to domain shifts (colloquial vs. formal registers).
    • Assess safety/cultural appropriateness in multilingual outputs and whether “closest to English reference” induces cultural bias or suppresses local varieties.
    • Investigate reward hacking beyond language fidelity (e.g., short answers, boilerplate CoT) and introduce detectors/penalties.
  • Methodological ablations and generality:
    • Provide ablations on the necessity of SFT before RL (e.g., RL from base with privileged judge), and quantify cold-start dynamics at very low base accuracies.
    • Test SP3F across different base models and sizes (e.g., Llama, Mistral, Gemma) to establish generality and identify architecture-specific interactions.
    • Compare privileged pairwise judging to alternative supervision (e.g., reference-based edit distance on CoT, semantic alignment via embeddings, monolingual judges with translation adapters).
  • Data leakage and licensing:
    • Check for overlap between DeepScaleR training items and evaluation sets (MGSM, MT‑Math100, Belebele, Global MMLU Lite) to rule out leakage.
    • Document licensing and translation provenance for the 18-language corpus and the new MGSM Indonesian set; ensure reproducibility of the exact data mixture.
  • Theoretical grounding:
    • Formalize the sample-complexity benefits claimed for privileged verifiers in the LLM setting (beyond footnote intuition) and derive conditions under which pairwise self-play with privileged info guarantees improvement.
    • Model how translation errors and judge inconsistencies affect the learned policy (e.g., bounds under noisy rewards and intransitive preferences).
  • Practical deployment considerations:
    • Measure latency and cost of inference-time pairwise judging in real systems; propose amortization strategies (e.g., distilled judges, student reward models) without losing privileged benefits.
    • Explore whether the trained models can serve as efficient self-judges in target languages after distillation from the privileged judge.

These items aim to pinpoint actionable directions to strengthen the method’s validity, scalability, and applicability across languages, domains, and real-world constraints.

Practical Applications

Immediate Applications

Below is a focused set of applications that can be deployed now or with modest engineering effort, leveraging SP3F’s two-stage training (SFT-on-translations + RL with verifiable rewards and privileged pairwise judges) and its demonstrated gains in multilingual reasoning, especially for lower-resource languages.

  • Multilingual customer support copilots with improved reasoning
    • Sectors: software, telecom, e-commerce, public services
    • Tools/Products/Workflows: fine-tune existing assistants with SP3F; enforce “language fidelity” to ensure replies are in the customer’s language; inference-time pairwise judging to pick the best response; fallback to “translate-test” when needed.
    • Assumptions/Dependencies: availability of English reference answers for common workflows; reliable translation to target languages; judge LLM and language classifier; compute for RL/self-play.
  • Low-resource language math tutoring and homework help (Indonesian, Swahili, Bengali, etc.)
    • Sectors: education
    • Tools/Products/Workflows: deploy SP3F-trained math tutor that provides step-by-step CoT in the student’s language; use privileged judges to grade reasoning chains against canonical English solutions, even if the final answer is wrong.
    • Assumptions/Dependencies: curated English canonical solutions; safety guardrails for pedagogy; domain verifiers for math (format/answer correctness).
  • Translation QA and dataset curation using privileged pairwise judges
    • Sectors: software localization, data operations, academia
    • Tools/Products/Workflows: pairwise judge compares bilingual responses to English references to detect semantic drift; reduce intransitive/noisy judgments; batch-level win-rate scoring for quality assurance.
    • Assumptions/Dependencies: English references of sufficient quality; judge access (e.g., GPT-4o-mini or equivalent); control for positional bias during judging.
  • Multilingual FAQ and forms assistants for government and NGOs
    • Sectors: policy, public sector, social services
    • Tools/Products/Workflows: SP3F pipeline to fine-tune assistants on multilingual public information; language-fidelity reward to keep output in the target language; inference-time judge ranking to ensure coherent reasoning in local languages.
    • Assumptions/Dependencies: English authoritative references for policies and procedures; translation coverage; funding/compute; compliance review.
  • Cross-lingual compliance and policy checkers (assistive, not authoritative)
    • Sectors: finance, insurance, HR/compliance
    • Tools/Products/Workflows: assistants reason over multilingual documents and align to English policy references; pairwise judges prioritize outputs that adhere to the reference policy language.
    • Assumptions/Dependencies: non-legal advisory positioning; domain-specific verifiable rewards (e.g., checklist completion); careful review to prevent overreliance on LLM judgments.
  • Internal documentation and product localization with reasoning consistency
    • Sectors: software, hardware manufacturing, consumer electronics
    • Tools/Products/Workflows: SP3F-tuned models ensure localized instructions maintain logical steps (CoT) and correct outcomes; privileged judge detects reasoning mismatches vs English source.
    • Assumptions/Dependencies: English source-of-truth; translation service; judge reliability; alignment review for safety-critical docs.
  • Model evaluation harnesses using privileged pairwise judges
    • Sectors: AI tooling, MLOps
    • Tools/Products/Workflows: integrate privileged pairwise judges to build robust, transitivity-aware evaluation suites; use empirical win-rates as a training or gating signal for deployments.
    • Assumptions/Dependencies: judge API and cost; mitigation of positional bias; tracking language fidelity thresholds and their impact on accuracy.
  • Data-efficient multilingual post-training for enterprise LLMs
    • Sectors: software/AI platforms
    • Tools/Products/Workflows: adopt SP3F to reach strong multilingual reasoning with ~1/8 the data; reuse English references for SFT and RL; DR.GRPO-compatible self-play loops in Verl or similar frameworks.
    • Assumptions/Dependencies: teacher model to produce English references; pipeline integration; compute budgets; licensing for base models and judges.

Long-Term Applications

These applications likely require further research, scaling, domain adaptation (especially verifiable rewards outside math), safety reviews, or ecosystem standardization before broad deployment.

  • Equitable multilingual clinical reasoning and triage assistants
    • Sectors: healthcare
    • Tools/Products/Workflows: SP3F-like training with domain verifiers (medical factuality, guideline adherence) and privileged canonical pathways; pairwise judges reduce noisy judgments and cold-starts.
    • Assumptions/Dependencies: stringent clinical validation; regulated deployment; robust domain verifiable rewards; data privacy; multilingual clinical corpora and references.
  • Multilingual legal analysis and drafting support
    • Sectors: legal, public policy
    • Tools/Products/Workflows: privileged judges compare outputs to authoritative English legal templates; verifiable rewards adapted to legal correctness and citation integrity; cross-lingual consistency checks.
    • Assumptions/Dependencies: domain-specific verifiers; oversight by professionals; risk management; quality bilingual references; jurisdictional nuances.
  • Standardized “Privileged Judge” services and protocols
    • Sectors: AI platforms, evaluation benchmarks
    • Tools/Products/Workflows: hosted or on-device pairwise judge services; APIs for privileged judging; benchmarks that measure transitivity, language fidelity, and cold-start mitigation.
    • Assumptions/Dependencies: cost-effective judge inference; improved non-transitivity handling; open standards; governance to avoid reward hacking.
  • Comprehensive multilingual educational platforms (beyond math)
    • Sectors: education
    • Tools/Products/Workflows: science, history, and literacy tutors with domain-specific verifiers; privileged references from vetted curricula; longitudinal tracking of reasoning quality across languages.
    • Assumptions/Dependencies: curriculum-aligned references; safety and bias audits; pedagogical research on CoT presentation.
  • Multilingual scientific literature review assistants
    • Sectors: academia, R&D
    • Tools/Products/Workflows: privilege English abstracts or canonical reviews to evaluate foreign-language reasoning; pairwise judges for consistency and factual alignment; domain verifiers (claim verification).
    • Assumptions/Dependencies: high-quality translations; citation-aware verifiable rewards; cross-lingual retrieval integration.
  • Cross-border policy harmonization and regulatory assistants
    • Sectors: government, international organizations
    • Tools/Products/Workflows: privilege English treaty/regulatory references to align multilingual reasoning; pairwise judging to reduce intransitive preference pitfalls; transparent evaluation pipelines.
    • Assumptions/Dependencies: authoritative references and multilingual legal corpora; stakeholder oversight; interpretability tooling.
  • Enterprise-scale training cost reduction and continual multilingual expansion
    • Sectors: software/AI platforms
    • Tools/Products/Workflows: institutionalize SP3F to onboard new languages with minimal data; continual learning with English references; adaptive judges tuned per domain.
    • Assumptions/Dependencies: translation systems for new languages; monitoring for language fidelity vs accuracy trade-offs; scalable RL infrastructure.
  • Domain-specific verifiable rewards beyond math (code, structured data, business rules)
    • Sectors: software engineering, analytics, operations
    • Tools/Products/Workflows: verifiers for code correctness, schema constraints, KPI calculations; privileged references (canonical solutions/specifications) to strengthen judge reliability.
    • Assumptions/Dependencies: robust, tamper-resistant verifiers; careful reward design to prevent gaming; dataset creation.
  • Inference-time pairwise self-play selection with privileged references
    • Sectors: AI application delivery
    • Tools/Products/Workflows: sample N responses, use pairwise judges with references to select the best output per query; caching and budget-aware judging; dynamic language fidelity control.
    • Assumptions/Dependencies: latency and cost tolerance; scalable judging microservices; privacy-sensitive handling of references.
  • Privileged-judge paradigms for non-language domains (planning, robotics, CAD)
    • Sectors: robotics, industrial design
    • Tools/Products/Workflows: apply the principle of privileged pairwise judges (with ground-truth plans/specs) to supervise complex behaviors where scalar rewards are sparse or noisy.
    • Assumptions/Dependencies: access to privileged data in simulations or specs; safe sim-to-real transfer; domain-aligned pairwise tasks; rigorous evaluation.

Notes on general assumptions and dependencies:

  • English reference responses are pivotal; availability and quality determine downstream performance.
  • Translation quality into target languages (and back) affects both SFT and judge reliability.
  • Verifiable rewards must be adapted per domain; math-style correctness/formatting won’t generalize as-is.
  • LLM-as-judge reliability, cost, positional bias, and intransitivity management are critical; privileged information helps but does not eliminate these challenges.
  • Compute budgets and access to base models/judges (licensing) are practical constraints.
  • Language fidelity thresholds can trade off against accuracy; careful tuning and guardrails are needed.
  • Ethical, cultural, and regulatory considerations grow with deployment in sensitive sectors (healthcare, legal, public policy).

Glossary

  • Batch-level judge feedback term: A component of the RL reward that aggregates pairwise comparisons within a batch to score responses. "with rewards given via a composition of four terms: three verifiable binary indicators, and one batch-level judge feedback term."
  • Chain-of-thought (CoT): The step-by-step reasoning text produced by an LLM before the final answer. "data (e.g., chains of thought, CoTs) that is primarily in English"
  • Cold start problem: Difficulty initiating effective learning when correct outputs are rare or training data is scarce. "Put together, we face a cold start problem we can't easily offline fine-tune our way out of."
  • DR.GRPO: A variant of the GRPO reinforcement learning algorithm used for post-training LLMs. "We use the DR.GRPO \citep{liu2025understandingr1zeroliketrainingcritical} variant of GRPO \citep{shao2024deepseekmathpushinglimitsmathematical}."
  • Empirical win rate: The fraction of pairwise comparisons a response wins against others, used as a reward signal. "assign each response a score equal to its empirical win rate."
  • GRPO: A policy-gradient RL algorithm applied to LLM training. "We use the DR.GRPO \citep{liu2025understandingr1zeroliketrainingcritical} variant of GRPO \citep{shao2024deepseekmathpushinglimitsmathematical}."
  • In-domain: Tasks aligned with the training domain of the model. "As seen in \autoref{tab:all_tasks}, SP3F-7B consistently out-performs Qwen2.5-7B-Instruct on both in-domain math tasks and out-of-domain not math tasks."
  • Intransitive preferences: Cyclic judgments where A is preferred to B, B to C, and C to A, breaking consistent ranking. "LLMs often exhibit intransitive (i.e., cyclic) preferences where they might rank ABA \succ B, BCB \succ C, and CAC \succ A"
  • Language classifier: An automated tool to detect the language composition of generated text. "We use an automated language classifier to check what fraction of the response is in the target language."
  • Language fidelity: The extent to which a model’s response is produced in the target language. "both in terms of accuracy and language fidelity (did the model answer in the target language?)."
  • LLM judge: A LLM used to evaluate and compare generated responses, providing preference feedback. "we propose using an LLM judge for supervision on the CoT zz."
  • Mode selection: A tendency of RL training to overcommit to a narrow subset of behaviors or outputs. "these RL algorithms are particularly prone to mode selection \citep{shao2025spuriousrewardsrethinkingtraining, oertell2025heuristics}, underscoring the need for a preliminary SFT step."
  • Multilingual training: Training across multiple languages to improve generalization and performance. "Finally, to take advantage of the repeatedly observed benefits of multilingual training \citep{yong2025crosslingualreasoningtesttimescaling,shi2022languagemodelsmultilingualchainofthought,muennighoff2023crosslingualgeneralizationmultitaskfinetuning}, we apply the above pipeline with data from 18 different languages (see \autoref{tab:language_list} for full list)."
  • Next-token prediction loss: The standard likelihood objective for supervised fine-tuning in LLMs. "maximize likelihood via a standard next-token prediction loss:"
  • On-policy: An RL training regime where updates are made using data generated by the current policy. "We opt for GRPO-style policy gradients in our work due to their relative simplicity and on-policy nature, which fits in cleanly to the self-play algorithmic template \citep{swamy2024minimaximalistapproachreinforcementlearning}."
  • Online RL: Applying reinforcement learning during generation to improve the model’s policy through interaction. "aiding in downstream mode selection via online RL \cite{yue2025doesreinforcementlearningreally} in the next stage."
  • Oracle: A reference evaluator used as an authoritative source for comparing models or responses. "use $\mathcal{P}_{\mathsf{no-priv}$ as an oracle to perform model-to-model comparisons."
  • Pairwise judge: An evaluator that compares two responses at a time and selects the better one. "we perform RL with feedback from a pairwise judge in a self-play fashion \cite{swamy2024minimaximalistapproachreinforcementlearning}"
  • Pairwise preferences: Judgments defined over pairs of items (responses) rather than absolute scores. "we adopt a self-play style approach that optimizes pairwise preferences directly"
  • Policy: A mapping from inputs to probability distributions over outputs in reinforcement learning. "We search over policies $\pi \in \Pi \subseteq {\mathcal{X} \to \Delta(\mathcal{Y})}."
  • Policy gradient techniques: RL methods that optimize policies by estimating gradients of expected rewards. "A wide spectrum of policy optimization algorithms have been proposed, from off-policy regression-based losses \citep{rafailov2023direct, gao2024rebel, azar2024general, gao2024regressing}, to policy gradient techniques \citep{ahmadian2024back, shao2024deepseekmathpushinglimitsmathematical, liu2025understandingr1zeroliketrainingcritical}."
  • Positional bias: Systematic preference influenced by the order in which options are presented. "To account for the positional bias of pairwise judges \citep{zheng2023judgingllmasajudgemtbenchchatbot, qin2024large}, we perform the standard averaging of judge preferences across both input orderings:"
  • Privileged information: Additional information available to the evaluator (not the generator) that aids more accurate judgment. "we provide the English reference response as privileged information \cite{vapnik2009new} to the judge"
  • PNT metric: A measure used to quantify non-transitivity in preferences. "We report the PNT metric proposed by \citet{xu2025investigatingnontransitivityllmasajudge}."
  • Reward hacking: Exploiting the reward function to achieve high scores via unintended behaviors. "helped avoid ``reward-hacking'' \citep{hadfield2017inverse}"
  • Reward modeling: Learning a function that maps outputs to scalar rewards consistent with preferences. "making standard reward modeling fundamentally misspecified."
  • RLHF: Reinforcement Learning from Human Feedback, a framework for post-training LLMs. "Popularized by RLHF~\citep{ouyang2022traininglanguagemodelsfollow}, reinforcement learning techniques have gained wide adoption in LLM post-training."
  • RLVR: Reinforcement Learning with Verifiable Rewards, an RL setup emphasizing automatically checkable signals. "By comparing the +RLVR and SP3F-7B rows of \autoref{tab:all_tasks}, we can more precisely identify the benefits of judge feedback."
  • Sample complexity: The amount of data needed to learn an effective model or component. "we have reduced the sample complexity of learning a verifier."
  • Scalar reward function: A single numerical score intended to reflect preferences or performance. "Such intransitivity means no scalar reward function can faithfully represent the judge's preferences"
  • Self-play: Training by comparing multiple generated responses against each other within the model’s own outputs. "in a self-play fashion \cite{swamy2024minimaximalistapproachreinforcementlearning}"
  • Self-Translate Test: An evaluation technique where the model translates the prompt into English and then answers. "Specifically, we use Self-Translate Test~\cite{etxaniz2023multilinguallanguagemodelsthink} where the translation are done using the model itself."
  • SP3F: The proposed framework “Self-Play with Privileged Pairwise Feedback” for multilingual reasoning. "We propose SP3F: Self-Play with Privileged Pairwise Feedback: a method for training multilingual reasoning models without any data in the target language(s)."
  • Teacher model: A stronger model that provides reference answers used for supervision or judgment. "can be relatively easily generated by a teacher model (e.g., o1 \citep{jaech2024openai}, R1 \cite{deepseekai2025deepseekr1incentivizingreasoningcapability})."
  • Total orderings: A global, consistent ranking of items without cycles. "provide more consistent global rankings (i.e., total orderings) over responses."
  • Translate-Test: A baseline where prompts are translated to English before being answered by the model. "In addition, we compare against Translate-Test~\cite{ponti2021modellinglatenttranslationscrosslingual, artetxe2023revisitingmachinetranslationcrosslingual}, where the query is translated into English before being solved by the model."
  • Verifiable rewards: Automatically checkable signals (e.g., correctness, format, language) used for RL supervision. "we perform RL (GRPO, \citet{shao2024deepseekmathpushinglimitsmathematical}) with feedback from verifiable rewards \cite{lambert2025tulu3pushingfrontiers} and a pairwise judge."
  • Win rate: The proportion of pairwise comparisons a response wins, used as the RL reward. "use the average win-rate of each response against the other N1N-1 samples as the reward for RL"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 16 tweets with 128 likes about this paper.