Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning

Published 23 Jul 2025 in cs.AI and cs.LG | (2507.17512v1)

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models' in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.

Abstract PDF Upgrade to Chat

Summary

The paper presents a data-centric analysis using RLVR to evaluate multi-domain reasoning, showing both synergy and trade-offs between domains.
The study employs GRPO with the Qwen-2.5-7B model to quantify in-domain gains and negative cross-domain effects among math, code, and puzzles.
It highlights the critical role of template consistency and adaptive reward design in achieving robust and generalizable performance.

Data-Centric Analysis of Multi-Domain Reasoning in RLVR for LLMs

Introduction

This paper presents a comprehensive empirical study of multi-domain reasoning in LLMs under the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm. The investigation is data-centric, focusing on the interplay between mathematical reasoning, code generation, and logical puzzle solving. The authors employ the Group Relative Policy Optimization (GRPO) algorithm and the Qwen-2.5-7B model family to systematically analyze in-domain and cross-domain effects, the impact of supervised fine-tuning (SFT), curriculum learning, reward design, and language-specific factors. The study provides nuanced insights into how domain-specific and cross-domain data influence both specialized and generalizable reasoning capabilities in LLMs.

Experimental Framework and Methodology

The experimental setup is rigorous, leveraging curated datasets for each domain: DeepScaleR and CountDown for math, CodeR1-12k for code, and Knights-and-Knaves (KK) and Logic Puzzle Baron (LPB) for puzzles. All datasets are normalized in size to control for data scale effects. The RLVR training is conducted using GRPO, which eschews a value model in favor of group-based advantage estimation, and all experiments are run on 8×A100 GPUs. Evaluation spans standard benchmarks: MATH500, AIME24, CountDown, HumanEval, MBPP, KK, and ZebraLogicBench, with strict control over prompt templates and few-shot settings to ensure reproducibility.

Single-Domain RLVR: In-Domain Gains and Cross-Domain Trade-offs

Mathematical Reasoning

RLVR on math data yields substantial in-domain improvements, with MATH500 accuracy increasing by up to 19.6 points and CountDown accuracy by over 75 points for the base model. However, this comes at the cost of degraded code generation performance, indicating a negative transfer from math to code. Notably, math training enhances puzzle-solving ability, suggesting shared logical structures between math and puzzles.

Code Generation

RLVR on code data (CodeR1-12k) significantly boosts HumanEval and MBPP scores for both base and instruct models. The instruct model, benefiting from SFT, consistently outperforms the base model. However, code RLVR has mixed cross-domain effects: it improves OOD performance for instruct models but constrains the base model's flexibility, leading to format errors in non-code tasks.

Figure 1: Performance on HumanEval, demonstrating substantial gains from code RLVR, especially for instruct models.

Logical Puzzles

Puzzle-specific RLVR (KK, LPB) dramatically increases in-domain accuracy (e.g., KK accuracy up to 99.14 for instruct models). Puzzle RLVR also transfers positively to math tasks but has inconsistent or negative effects on code generation, especially when training is dominated by fixed-format puzzle data.

Multi-Domain RLVR: Synergy, Conflict, and Generalization

Dual-Domain Combinations

Pairwise domain combinations reveal non-trivial interactions. Math+Puzzle training yields synergistic improvements in math and puzzle tasks but degrades code performance, highlighting the challenge of balancing structurally divergent domains. Puzzle+Code achieves the best overall dual-domain performance, indicating that code and puzzle data can be mutually reinforcing under certain conditions.

Triple-Domain Training

Incorporating all three domains further improves overall performance and task balance, with the highest aggregate accuracy observed in the triple-domain setting. However, negative transfer persists for highly specialized tasks (e.g., puzzle accuracy drops compared to puzzle-only training), underscoring the specialization-generalization trade-off.

Figure 2: Performance comparison of triple-domain and optimal dual-domain data, illustrating the benefits and trade-offs of broader domain coverage.

Template Consistency and Robustness

A critical finding is the extreme sensitivity of RLVR-trained models to prompt template alignment. Mismatched templates between training and evaluation can cause severe performance degradation across all domains, particularly for complex tasks like KK and ZebraLogicBench. This exposes a significant robustness gap in current RLVR approaches.

Curriculum Learning and Policy Refresh

Curriculum learning, implemented via progressive difficulty stratification in the KK dataset, raises the upper bound of puzzle-solving performance. The introduction of a policy refresh strategy—periodically updating the reference model and resetting the optimizer—further accelerates convergence and improves final accuracy, nearly achieving perfect scores.

Figure 3: Model performance on the KK dataset with different curriculum settings, showing the efficacy of curriculum learning and policy refresh.

Reward Design: Task-Dependent Efficacy

Reward scheme selection is shown to be highly task-dependent. Binary rewards are optimal for tasks with low solution sparsity (e.g., KK), while partial and rescaled rewards are necessary for harder, sparser tasks (e.g., LPB). The study highlights the limitations of response-level partial rewards and suggests that finer-grained, cell-level reward signals are needed for further progress.

Figure 4: Performance on KK, comparing the impact of different reward configurations.

Figure 5: KK-impact of reward configurations (base model shown with dashed lines), illustrating the sensitivity to reward design.

Figure 6: LPB-impact of reward configurations (base model shown with dashed lines), showing the necessity of partial and rescaled rewards for sparse tasks.

Language Sensitivity

RLVR-trained models exhibit a consistent performance gap when trained in Chinese versus English, even with identical data and reward enforcement. This highlights a limitation in current RLVR methods for cross-lingual generalization in complex reasoning tasks.

Implications and Future Directions

The study provides several actionable insights for RLVR-based LLM post-training:

Domain synergy is not universal: While math and puzzle data are mutually supportive, code data can introduce both positive and negative cross-domain effects depending on model initialization and SFT.
Multi-domain training enhances generalization and stability: Broader domain coverage improves aggregate performance and mitigates catastrophic forgetting, but may reduce peak specialization.
Template alignment is critical: Robustness to prompt format remains a major challenge for RLVR-trained models.
Reward design must be task-adaptive: No single reward scheme suffices across all domains; future work should explore more granular, context-sensitive reward mechanisms.
Language remains a bottleneck: Cross-lingual RLVR requires further methodological advances.

Conclusion

This work delivers a systematic, data-centric analysis of multi-domain reasoning in LLMs under RLVR, elucidating the complex interactions between domain-specific and cross-domain training, the importance of SFT, curriculum learning, reward design, and language effects. The findings inform best practices for constructing robust, generalizable reasoning models and highlight open challenges in template robustness, reward granularity, and cross-lingual transfer. Future research should extend this analysis to additional domains (e.g., science, general reasoning), larger model families, and more sophisticated reward and curriculum strategies to further advance the state of multi-domain reasoning in LLMs.