Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

Published 29 Jun 2025 in cs.LG and cs.AI | (2507.00075v3)

Abstract: Self-improvement is among the most prominent techniques within the realm of LLMs (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM's solver capability and verifier capability. Based on the theoretical framework, we further show how to model the entire training trajectory. This framework allows quantifying the capability limit of self-improvement by fitting the theoretical model to the experiment results. We empirically validate the effectiveness of the theoretical framework on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.

Abstract PDF Upgrade to Chat

Summary

The paper presents a theoretical framework that quantifies the solver-verifier gap as the main driver of LLM self-improvement.
Experiments across datasets confirm that both solver and verifier capabilities improve exponentially, validating the differential equation model.
The study also explores cross-improvement, showing that external data enhances verifier performance and overall training efficiency.

Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

Introduction

This paper proposes a framework to model the self-improvement training dynamics of LLMs using a solver-verifier gap. It aims to understand the mechanisms driving performance improvements without external data and to quantify the limits of self-improvement. The concept is based on the performance differences between a model's solver and verifier capabilities, which are theorized to drive self-improvement.

Self-Improvement Framework

The solver capability refers to the generation of responses by an LLM, while the verifier capability involves the LLM evaluating multiple generated responses to determine their quality. This paper models the training dynamics using differential equations inspired by potential energy principles. The equations show that capabilities follow an exponential convergence law, where solver and verifier capabilities improve at rates defined by a solver-verifier gap. This framework enables the prediction of a model's ultimate capabilities based on initial conditions and rates of change.

Figure 1: Illustration of the theoretical results. The training dynamics are potentially empowered by the solver-verifier gap under the theoretical framework.

Experiments and Verification

The paper validates its theoretical framework through experiments on various LLMs and datasets. These experiments demonstrate that the solver-verifier gap concept effectively models self-improvement dynamics. Metrics such as accuracy and uncertainty during improvement phases confirm the exponential behavior predicted by the theoretical model.

Figure 2: Accuracy and uncertainty during the self-improvement of the Phi-4-mini model on the Math and GSM8k datasets using the QE method.

Cross-Improvement with External Data

Beyond self-improvement, this framework also examines cross-improvement, where external data supplement the training process. Cross-improvement is modeled similarly, showing that external data influence the verifier capability and thereby enhance the overall improvement dynamics. The analysis indicates that the stage at which external data is used is less critical than its overall influence on the verifier's capability.

Figure 3: A diagram of Cross Improvement.

Practical Implications and Future Directions

The findings carry significant implications for both theoretical understanding and practical applications of self-improvement in LLMs. By illuminating the role of the solver-verifier gap, this work provides a foundational framework that can be used to optimize self-improvement training strategies. Future research directions include exploring adaptive methods for self-improvement, leveraging more sophisticated external data integration, and understanding the underlying neural mechanisms in more detail.

Conclusion

This paper presents a theoretical framework for understanding and modeling self-improvement dynamics in LLMs through solver-verifier gaps. The validation through empirical data demonstrates its practical applicability, and the extension to cross-improvement offers insights into boosting LLM capabilities with limited external data. These findings hold promise for enhancing the efficiency and effectiveness of LLM training processes.