Demystifying Multilingual Chain-of-Thought in Process Reward Modeling

Published 18 Feb 2025 in cs.CL | (2502.12663v1)

Abstract: LLMs are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi-step reasoning, recent research leverages process reward modeling to provide fine-grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early-stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data, while also uncovering the benefits arising from more candidate responses and trainable parameters. This work opens promising avenues for robust multilingual applications in complex, multi-step reasoning tasks. In addition, we release the code to foster research along this line.

Abstract PDF Upgrade to Chat

Summary

The paper extends process reward models (PRMs) to multilingual LLMs by creating a comprehensive dataset translated into six languages.
Multilingual PRMs consistently outperform monolingual models, showing up to +1.5 points accuracy improvement and reducing early-stage reasoning errors.
The findings suggest training multilingual PRMs generalizes reasoning beyond English, offering potential for universal models and step-by-step reinforcement learning.

The paper "Demystifying Multilingual Chain-of-Thought in Process Reward Modeling" investigates the extension of process reward models (PRMs) to multilingual settings for LLMs, focusing on complex tasks requiring multi-step reasoning. The authors address the limitations of current reward models, which are predominantly focused on English, by translating existing PRM datasets from English into six additional languages, creating a comprehensive multilingual dataset.

Key Components and Methodology:

Multilingual Dataset Creation: The authors translate existing datasets (PRM800K and Math-Shepherd) into seven languages, allowing for the training and evaluation of multilingual PRMs.
Experimental Setups: Three PRM setups are defined:
1. PRM-mono: Trained and evaluated on a single language.
2. PRM-cross: Trained on one language but evaluated across multiple languages.
3. PRM-multi: Trained on multiple languages and evaluated on a broader set.
Evaluation Metrics: The work employs two reasoning benchmarks across 11 languages (including both seen and unseen languages) to evaluate the performance of the multilingual PRMs against existing models.

Findings:

Performance Superiority: Multilingual PRMs consistently outperform monolingual and cross-lingual PRMs across various LLMs, improving the average accuracy by up to +1.5 points over PRM-mono.
Sensitivity: The performance of multilingual PRMs is sensitive to the number of training languages and the volume of English data. Optimal performance is observed when leveraging a moderate number of languages and balancing English data representation.
Error Reduction: Multilingual PRMs reduce early-stage reasoning errors, suggesting that diverse language training enhances reasoning reliability.
Parameter Efficiency: Greater numbers of trainable parameters and candidate responses amplify the benefits of PRMs in multilingual contexts.

Implications:

Generalization Beyond English: The findings highlight the potential for training robust multilingual PRMs that generalize effectively across a wide spectrum of languages.
Step-by-Step Reinforcement Learning: By leveraging PRMs in a reinforcement learning framework, LLMs can receive more granular feedback, potentially refining reasoning processes further.

The paper provides substantial empirical evidence, suggesting that diverse multilingual input during training can overcome language-specific biases and improve cross-lingual transfer, ultimately enhancing the global applicability of LLMs. Additionally, the work opens avenues for developing universally applicable reasoning models, thereby addressing key challenges in multilingual AI. The authors also release the code to encourage continued research and development in this area.

Markdown Report Issue