SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation

Published 22 May 2025 in cs.CL, cs.AI, and cs.LG | (2505.16637v3)

Abstract: LLMs have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a self-rewarding RL framework that uses the model itself to generate reward signals, bypassing the need for external annotated data.
The methodology leverages Group Related Policy Optimization and is validated on English-Chinese benchmarks, demonstrating improved translation performance.
The study indicates that SSR, especially when combined with external rewards, can enhance scalability and performance in low-resource machine translation scenarios.

SSR-Zero: Self-Rewarding Reinforcement Learning for Machine Translation

Introduction

The study "SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation" introduces a novel reinforcement learning (RL) approach specifically tailored for machine translation (MT) that does not rely on external reward models or human-annotated reference data. Despite the impressive abilities of LLMs in MT, the reliance on external supervision such as annotated reference data is a significant barrier due to the expense and difficulty in acquiring such resources. The proposed Simple Self-Rewarding (SSR) RL framework aims to tackle this challenge by implementing a self-judging mechanism using the model itself to generate reward signals. This method potentially reduces the overhead associated with scaling MT systems.

Methodology

The SSR framework employs a pre-trained LLM which acts dually as an actor and a judge within the training process (Figure 1). The model generates candidate translations for given inputs and evaluates these translations to derive reward signals. This approach follows an R1-like RL training paradigm but shifts the evaluation burden away from external models.

Figure 1: Overview of the SSR framework. SSR is an R1-Zero-like RL training method for machine translation, which uses the same model as both actor and judge. It does not require external reward models or human-annotated reference data. Prompts shown here are simplified for clarity.

The SSR method utilizes Group Related Policy Optimization (GRPO) for model optimization. Training involves GRPO updating of model parameters based on rewards computed from self-evaluated translations.

Experimental Setup

The experiments focused on English-Chinese translation tasks. The SSR-Zero model was built upon the Qwen2.5-7B backbone, trained on 13,130 monolingual examples. It was evaluated on several benchmarks, including WMT23, WMT24, and FLORES-200, using COMETKIWI-XXL and XCOMET-XXL metrics.

Performance and Results

SSR-Zero-7B outperformed several existing open-source MT-specific models as well as some larger LLMs across English-to-Chinese (EN→ZH) and Chinese-to-English (ZH→EN) translation tasks. The model indicates substantial improvements in both directions with self-generated rewards leading to notable performance gains.

Enhancing SSR with external reward models such as COMET further improved the model, leading to SSR-X-Zero-7B achieving state-of-the-art (SOTA) performance in the evaluated benchmarks.

Figure 2: Changes in average response length (a) and training rewards (b) of SSR/SSR-X-Zero-7B during GRPO training.

Figure 3: Changes in translation quality during training, measured by the average scores of COMETKIWI-XXL and XCOMET-XXL on the EN → ZH (a) and ZH → EN (b) benchmarks.

Comparative Analysis

The SSR framework has a clear advantage over external LLM-based rewards of similar size, indicating efficacy in leveraging self-generated feedback for performance enhancement. However, dedicated MT evaluation reward models like COMET and COMETKIWI exhibit superior capabilities when trained on substantial annotated data.

Conclusion

The SSR framework offers an effective methodology for training MT models devoid of extensive external resources. The research demonstrates the capability of SSR to enhance translation performance significantly and highlights the complementary benefits when combined with external reward signals. This self-rewarding approach opens avenues for developing scalable MT systems with reduced dependency on external supervision, especially beneficial in low-resource scenarios. Future investigations may expand the applicability of SSR across different languages and model architectures, while exploring the potential of alternative prompting and self-improvement techniques.