Crosslingual Reasoning through Test-Time Scaling

Published 8 May 2025 in cs.CL, cs.AI, and cs.LG | (2505.05408v1)

Abstract: Reasoning capabilities of LLMs are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning LLMs (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.

Abstract PDF Upgrade to Chat

Summary

The paper's main contribution shows that test-time scaling significantly enhances crosslingual reasoning performance in multilingual benchmarks.
It reveals that models with 3B+ parameters gain notable improvements, especially in STEM domains, when scaling inference compute is applied.
The study also explores language-mixing and forcing, highlighting the need for better strategies in handling low-resource languages.

Crosslingual Reasoning through Test-Time Scaling

Introduction

The paper "Crosslingual Reasoning through Test-Time Scaling" investigates the crosslingual capabilities of reasoning LLMs (RLMs) primarily fine-tuned for English tasks, expanding the exploration to multilingual settings. It provides insights into how test-time inference scaling affects multilingual reasoning, offering practical strategies and highlighting the limitations of these models.

Crosslingual Test-Time Scaling

The paper examines the effectiveness of crosslingual test-time scaling, focusing on models like s1, which are fine-tuned on English reasoning data but evaluated on a multilingual benchmark (MGSM). The analysis shows that scaling inference compute enhances multilingual reasoning performance, surpassing even larger models that do not employ scaling.

Figure 1: Crosslingual test-time scaling of s1 and Qwen models on the MGSM benchmark (excluding English) across different model sizes.

The researchers found that models with parameter sizes of 3B and above benefited significantly from test-time scaling, demonstrating superior performance on multilingual math reasoning tasks.

Language-Mixing Behaviors

The paper explores how English-centric RLMs exhibit language mixing in their reasoning outputs. It identifies a "quote-and-think" pattern where models quote non-English phrases and then process these within English reasoning, highlighting multilingual parsing capabilities.

Figure 2: Proportion of dominant languages in models' entire responses when queried with multilingual math questions.

This behavior reflects the influence of English finetuning on multilingual model outputs, showing that models can maintain non-English phrase understanding while predominantly reasoning in English.

Language Forcing

The study investigates "language forcing," where models are compelled to reason in the language of the input query. Different strategies, such as prefix prompts and system instructions, are tested to control reasoning language.

The experiments reveal that while it is possible to enforce language compliance, reasoning in high-resource languages tends to yield better performance than low-resource ones. The results also underline challenges in maintaining task performance while forcing a specific language for reasoning.

Cross-Domain Generalization

The research evaluates cross-domain generalization, comparing the models' performance on STEM versus non-STEM domains. Test-time scaling proves beneficial for STEM domains but shows limited advantages for cultural reasoning tasks, indicating domain-specific generalization limits.

Figure 3: Effects of thinking time for s1 models on different domains of Global-MMLU benchmark and cultural commonsense benchmarks.

Conclusion

The paper concludes that while English-centric RLMs supplemented with test-time scaling excel in multilingual reasoning, especially in high-resource languages, there is a need for more robust handling of low-resource languages and non-STEM domains. Future directions include exploring data-efficient multilingual training data and developing equitable tokenization strategies to enhance reasoning across diverse languages.

Markdown