- The paper demonstrates that random sampling yields competitive proxy win-rates and preserves model capability while greatly reducing computational overhead compared to active selection.
- The paper employs a controlled empirical comparison using entropy-based prompt selection and reward-margin filtering, revealing trade-offs between alignment metrics and general performance.
- The paper shows that although active selection can reduce variance in fragile settings, its marginal benefits do not justify the significant extra computational cost for robust LLMs.
Active Selection Versus Random Sampling in Online DPO for Modern LLMs
Motivation and Background
LLMs, through web-scale pretraining, acquire strong distributional priors, making post-training predominantly an exercise in fine-grained behavioral alignment rather than knowledge acquisition. Direct Preference Optimization (DPO) and its online variants promise robust, on-policy preference learning with reduced distributional mismatch. Active Preference Learning (APL), built on uncertainty and reward-margin-based acquisition, theoretically targets data efficiency by selectively labeling informative pairs. Nevertheless, the inherent informativeness and diversity in the candidate pool of online DPOโespecially as current policies are strongly pretrainedโraises a critical question about the necessity and utility of complex data selection approaches.
Methodological Overview
The study establishes a controlled empirical comparison between random sampling and uncertainty-driven APL within online DPO, covering harmlessness, helpfulness, and instruction-following objectives. Candidate responses are generated from current policy, paired, and labeled either randomly or via APL (two-stage: entropy-based prompt selection and reward-margin filtering). Proxy judges cover a spectrum from weak (DeBERTa-v3-large) to strong and safety-aligned (Skywork, Beaver, GPT-5 family).
The evaluation consists of two primary metrics: proxy win-rate (relative to an SFT reference, given a proxy judge), and mean ฮ acc_norm score over standard LM evaluation benchmarks, exposing potential trade-offs between perceived alignment and general capability.
Empirical Findings
Active Selection Does Not Outperform Random Sampling
Across model families and tasks, APL fails to deliver statistically significant improvements in proxy win-rate or capability preservation when compared to random sampling, and sometimes underperforms random, especially under strong proxy judges or robust base models. Notably, random sampling delivers "cheap diversity" and sufficient learning signal due to the richness of the on-policy candidate pool.
Figure 1: Comparative capability/win-rate performance for Qwen3-1.7B across datasets and judges, showing negligible gains for APL relative to random sampling.
Proxy Win-Rate and Capability Collapse Are Decoupled
Overoptimization of weak reward models (e.g., DeBERTa) enables models to attain high proxy win-rates while suffering catastrophic general capability loss (e.g., ฮ acc_norm <โ10%), revealing fundamental metric failures and the risk of Goodharting.
Figure 2: Pareto frontier analysis for harmlessness, indicating zones where increased proxy win-rate is associated with severe capability regression.
APL as a Variance Reducer in Fragile Regimes
APL demonstrates stabilization in low-capability or collapse-prone settings (e.g., Gemma-2B on harmlessness), sometimes suppressing catastrophic driftโrandom selection is susceptible to seed-level collapse reflected in large variance. However, this effect diminishes for more robust base models, and the computational overhead of APL (โผ20ร wall-clock per cycle versus random) renders its utility marginal.
Judge Scaling and Evaluator Dependence
In oracle experiments with GPT-5 family judges, random sampling consistently matches or outperforms APL, suggesting that the limiting factor is not supervision quality at current scales, but the headroom for selection strategies amid strong initialization.
Practical and Theoretical Implications
The findings challenge the theoretical motivation for APL in practical online DPO pipelines by highlighting the limited marginal utility of data selection amid signal-rich candidate pools. As strong priors make on-policy samples broadly informative, active selection incurs substantial computational cost with little empirical justification, except as a variance-reduction mechanism in fragile settings.
Evaluation practices in online preference optimization must be wary of proxy-induced capability collapse, underscoring the need for capability sanity checks alongside preference win-rate. These results reinforce the Lima hypothesis: even post-training with small, randomly drawn datasets suffices for stylistic alignment in modern LLMs [zhou2023lima].
Limitations and Future Directions
The studyโs scope is limited to models up to 7B parameters, a single variant of APL, and two datasets. The generalization of results to frontier-scale models and other APL variants (e.g., diversity-based, curriculum-scheduled) remains open. Further, the interplay between selection strategies and proxy judges requires systematic ablation, especially as alignment demands shift in novel domains or models.
Conclusion
Random sampling constitutes a robust, competitive baseline in online DPO for modern LLMs, offering near-optimal trade-offs between proxy win-rate and capability preservation, with substantially reduced computational overhead. Active selectionโwhen implemented as uncertainty-driven APLโrarely justifies its cost, except as a variance filter in fragile configurations. Post-training alignment dynamics in the regime of strong LLM priors diminish the theoretical advantage of active learning, emphasizing the practical efficiency and sufficiency of random sampling for scalable online DPO (2604.02766).