Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

Published 3 Apr 2026 in cs.LG and cs.AI | (2604.02766v1)

Abstract: Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty-based APL against Random across harmlessness, helpfulness, and instruction-following settings, utilizing both reward models and LLM-as-a-judge proxies. We find that APL yields negligible improvements in proxy win-rates compared to Random. Crucially, we observe a dissociation where win-rate improves even as general capability -- measured by standard benchmarks -- degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our findings suggest that in the regime of strong pre-trained priors, the computational overhead of active selection is difficult to justify against the ``cheap diversity'' provided by simple random samples. Our code is available at https://github.com/BootsofLagrangian/random-vs-apl.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates that random sampling yields competitive proxy win-rates and preserves model capability while greatly reducing computational overhead compared to active selection.
The paper employs a controlled empirical comparison using entropy-based prompt selection and reward-margin filtering, revealing trade-offs between alignment metrics and general performance.
The paper shows that although active selection can reduce variance in fragile settings, its marginal benefits do not justify the significant extra computational cost for robust LLMs.

Active Selection Versus Random Sampling in Online DPO for Modern LLMs

Motivation and Background

LLMs, through web-scale pretraining, acquire strong distributional priors, making post-training predominantly an exercise in fine-grained behavioral alignment rather than knowledge acquisition. Direct Preference Optimization (DPO) and its online variants promise robust, on-policy preference learning with reduced distributional mismatch. Active Preference Learning (APL), built on uncertainty and reward-margin-based acquisition, theoretically targets data efficiency by selectively labeling informative pairs. Nevertheless, the inherent informativeness and diversity in the candidate pool of online DPO—especially as current policies are strongly pretrained—raises a critical question about the necessity and utility of complex data selection approaches.

Methodological Overview

The study establishes a controlled empirical comparison between random sampling and uncertainty-driven APL within online DPO, covering harmlessness, helpfulness, and instruction-following objectives. Candidate responses are generated from current policy, paired, and labeled either randomly or via APL (two-stage: entropy-based prompt selection and reward-margin filtering). Proxy judges cover a spectrum from weak (DeBERTa-v3-large) to strong and safety-aligned (Skywork, Beaver, GPT-5 family).

The evaluation consists of two primary metrics: proxy win-rate (relative to an SFT reference, given a proxy judge), and mean $\Delta$ acc_norm score over standard LM evaluation benchmarks, exposing potential trade-offs between perceived alignment and general capability.

Empirical Findings

Active Selection Does Not Outperform Random Sampling

Across model families and tasks, APL fails to deliver statistically significant improvements in proxy win-rate or capability preservation when compared to random sampling, and sometimes underperforms random, especially under strong proxy judges or robust base models. Notably, random sampling delivers "cheap diversity" and sufficient learning signal due to the richness of the on-policy candidate pool.

Figure 1: Comparative capability/win-rate performance for Qwen3-1.7B across datasets and judges, showing negligible gains for APL relative to random sampling.

Proxy Win-Rate and Capability Collapse Are Decoupled

Overoptimization of weak reward models (e.g., DeBERTa) enables models to attain high proxy win-rates while suffering catastrophic general capability loss (e.g., $\Delta$ acc_norm $<-10\%$ ), revealing fundamental metric failures and the risk of Goodharting.

Figure 2: Pareto frontier analysis for harmlessness, indicating zones where increased proxy win-rate is associated with severe capability regression.

APL as a Variance Reducer in Fragile Regimes

APL demonstrates stabilization in low-capability or collapse-prone settings (e.g., Gemma-2B on harmlessness), sometimes suppressing catastrophic drift—random selection is susceptible to seed-level collapse reflected in large variance. However, this effect diminishes for more robust base models, and the computational overhead of APL ( $\sim$ 20 $\times$ wall-clock per cycle versus random) renders its utility marginal.

Judge Scaling and Evaluator Dependence

In oracle experiments with GPT-5 family judges, random sampling consistently matches or outperforms APL, suggesting that the limiting factor is not supervision quality at current scales, but the headroom for selection strategies amid strong initialization.

Practical and Theoretical Implications

The findings challenge the theoretical motivation for APL in practical online DPO pipelines by highlighting the limited marginal utility of data selection amid signal-rich candidate pools. As strong priors make on-policy samples broadly informative, active selection incurs substantial computational cost with little empirical justification, except as a variance-reduction mechanism in fragile settings.

Evaluation practices in online preference optimization must be wary of proxy-induced capability collapse, underscoring the need for capability sanity checks alongside preference win-rate. These results reinforce the Lima hypothesis: even post-training with small, randomly drawn datasets suffices for stylistic alignment in modern LLMs [zhou2023lima].

Limitations and Future Directions

The study’s scope is limited to models up to 7B parameters, a single variant of APL, and two datasets. The generalization of results to frontier-scale models and other APL variants (e.g., diversity-based, curriculum-scheduled) remains open. Further, the interplay between selection strategies and proxy judges requires systematic ablation, especially as alignment demands shift in novel domains or models.

Conclusion

Random sampling constitutes a robust, competitive baseline in online DPO for modern LLMs, offering near-optimal trade-offs between proxy win-rate and capability preservation, with substantially reduced computational overhead. Active selection—when implemented as uncertainty-driven APL—rarely justifies its cost, except as a variance filter in fragile configurations. Post-training alignment dynamics in the regime of strong LLM priors diminish the theoretical advantage of active learning, emphasizing the practical efficiency and sufficiency of random sampling for scalable online DPO (2604.02766).

Markdown Report Issue