Alignment methods that preserve behavioral distributions while adding helpfulness

Develop alignment techniques that maintain the empirical distribution of human strategic behavior encoded in pre-trained models while simultaneously improving helpfulness and instruction-following, thereby avoiding distributional collapse that harms behavioral prediction.

Background

The results suggest current alignment (e.g., RLHF/DPO) narrows output distributions toward annotator-approved, normatively rational responses, degrading predictive fidelity for multi-round strategic behavior shaped by reciprocity and history.

A methodological open problem is to design alignment procedures that add helpfulness without collapsing behavioral diversity, so models can remain accurate proxies for human behavior in strategic contexts.

References

Several open questions follow naturally. From an alignment perspective, developing methods that preserve empirical behavioral distributions while adding helpfulness is a natural direction.

— Alignment Makes Language Models Normative, Not Descriptive (2603.17218 - Shapira et al., 17 Mar 2026) in Discussion and Conclusion

Alignment methods that preserve behavioral distributions while adding helpfulness

Background

References

Related Problems