Does matched task performance imply similar parameter-space solutions for ES vs RL-based LLM fine-tuning?

Determine whether achieving comparable task performance when fine-tuning large language models with Evolution Strategies, in comparison to reinforcement learning–based fine-tuning methods such as Group Relative Policy Optimization, implies that the resulting parameter vectors are comparable in parameter space (for example, occupying similar regions or exhibiting similar geometric properties).

Background

Evolution Strategies are a gradient-free alternative to reinforcement learning methods for post-training LLMs. Group Relative Policy Optimization is a widely used reinforcement learning approach in this setting. Although prior studies showed that Evolution Strategies can sometimes match reinforcement learning methods in downstream accuracy, the relationship between matched task performance and the similarity of the resulting parameter vectors remained uncertain.

Clarifying whether similar accuracies correspond to similar parameter-space solutions is important for understanding potential differences in geometry, such as update norms, directions, and off-task drift, which in turn bear on stability, transfer, and forgetting in continual learning scenarios.

References

Evolution Strategies (ES) have emerged as a scalable gradient-free alternative to reinforcement learning based LLM fine-tuning, but it remains unclear whether comparable task performance implies comparable solutions in parameter space.