Inference-Time Reward Hacking in Large Language Models

Published 24 Jun 2025 in cs.LG | (2506.19248v1)

Abstract: A common paradigm to improve the performance of LLMs is optimizing for a reward model. Reward models assign a numerical score to LLM outputs indicating, for example, which response would likely be preferred by a user or is most aligned with safety goals. However, reward models are never perfect. They inevitably function as proxies for complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance -- a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of-$n$ (BoN) and Soft-Best-of-$n$ (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of the optimal reward-KL divergence policy at inference time. We show that the characteristic pattern of hacking as observed in practice (where the true reward first increases before declining) is an inevitable property of a broad class of inference-time mechanisms, including BoN and BoP. To counter this effect, hedging offers a tactical choice to avoid placing undue confidence in high but potentially misleading proxy reward signals. We introduce HedgeTune, an efficient algorithm to find the optimal inference-time parameter and avoid reward hacking. We demonstrate through experiments that hedging mitigates reward hacking and achieves superior distortion-reward tradeoffs with minimal computational overhead.

Abstract PDF Upgrade to Chat

Summary

The paper proposes HedgeTune to optimize hyperparameters and determine the hacking threshold for mitigating reward hacking in LLMs.
It introduces Best-of-Poisson, a Poisson-based sampling method that efficiently balances exploration and exploitation.
The study highlights practical implications for reliable LLM performance by aligning proxy rewards with true objectives in critical applications.

Inference-Time Reward Hacking in LLMs

Introduction

The paper "Inference-Time Reward Hacking in LLMs" addresses the challenge of reward hacking in alignment methods used for LLMs. Reward hacking occurs when an AI system optimizes proxy reward signals that do not perfectly align with true goals, leading to unexpected or undesired behavior. The study focuses on inference-time methods like Best-of- $n$ (BoN), Soft Best-of- $n$ (SBoN), and the newly proposed Best-of-Poisson (BoP) for aligning LLM outputs with intended objectives while highlighting mitigation strategies.

Figure 1: The mismatch between the proxy and gold rewards manifests through the winner's curse. In an ideal world where we could optimize directly on the gold reward, its value would rise monotonically. However, since we are optimizing for a proxy, the gold reward peaks and then collapses.

Key Concepts and Background

The core challenge explored is the disparity between proxy rewards (an approximation used during training) and true or gold rewards (the desired model behavior). The paper introduces the notion of the hacking threshold, which is the point where further optimization against a proxy reward leads to performance degradation. Significant implications of this phenomenon include unreliable AI behavior in critical applications when overly relying on proxy rewards.

Inference-Time Methods and Reward Hacking

Inference-time methods like BoN are commonly employed because of their efficiency and effectiveness in enhancing LLMs by sampling multiple outputs and selecting the best under a proxy reward model. However, such methods are prone to the "winner's curse," where selected outputs score highly on proxy rewards but poorly on true metrics, due to misalignment between proxy and true rewards.

HedgeTune, an algorithm developed in this paper, identifies the optimal parameters for inference-time alignment methods, preventing overoptimization on proxy rewards. The algorithm is applied to align the selected hyperparameters to the hacking threshold.

Figure 2: Use of three inference-time methods (BoN, SBoN, and BoP) on trained proxy rewards. Hacking is effectively mitigated by hedging via lambda in SBoN or n in BoN and BoP.

Best-of-Poisson: A New Approach

BoP, introduced in this paper, draws a random sample size from a Poisson distribution, offering a balance between excessive exploitation and strict adherence to reference distributions. It approximates the optimal reward-KL distortion tradeoff and serves as a computationally efficient alternative to traditional methods.

Figure 3: KL divergence gap between BoP and the optimal tilted distribution with respect to the reference distribution.

Practical Implications

The research has significant implications for designing LLMs for real-world applications where reliance solely on proxy rewards could lead to unreliable systems. Methods developed in this study, such as HedgeTune and BoP, offer pathways to ameliorate reward hacking by achieving a balance between achieving high proxy scores and closely aligning with true objectives.

Conclusion

This paper provides a vital contribution to AI alignment research by theoretically characterizing and empirically demonstrating inference-time reward hacking and proposing strategies like HedgeTune and BoP to mitigate its impact. These methods ensure AI systems reliably meet their intended objectives while reducing the risk of unpredictable behavior due to reward hacking.