Alignment Makes Language Models Normative, Not Descriptive

Published 17 Mar 2026 in cs.CL, cs.AI, and cs.GT | (2603.17218v1)

Abstract: Post-training alignment optimizes LLMs to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that aligned models trade descriptive fidelity for normative bias, leading to a performance drop in multi-round strategic tasks.
Methodologically, the study systematically compares 120 base–aligned model pairs across 10,000 human decisions, showing a base-model win ratio of 9.7:1.
Implications stress that using aligned LLMs for behavioral simulations may misrepresent actual human behavior, urging new paradigms to preserve output diversity.

Alignment-Induced Normative Bias in LLMs: A Systematic Comparison of Base and Aligned LLMs

Motivation and Context

The paper "Alignment Makes LLMs Normative, Not Descriptive" (2603.17218) systematically interrogates the pervasive assumption that post-training alignment methods, such as RLHF or DPO, inherently improve LLMs as proxies for human behavior. While aligned models are frequently deployed both in user-facing applications and research settings to replicate or predict human decisions, this work challenges the foundational equivalence between maximizing human approval and accurately modeling human behavior — an issue of paramount importance for behavioral and social science simulations using LLMs. The authors argue that alignment introduces a normative bias, shifting model predictions from descriptive fidelity (what people actually do) to normative regularity (what is socially approved). This distinction is especially prominent in strategic domains where actual human actions systematically diverge from idealized behavior.

Experimental Framework

The study constructs an extensive empirical testbed: 120 base–aligned model pairs from 23 families spanning over 10,000 real human decisions, derived from multi-round games in economic and psychological benchmarks (bargaining, persuasion, negotiation, repeated matrix games). The evaluation is based on Pearson correlation between LLM-assigned decision probabilities and actual human choices at each decision point, utilizing deterministic logprob extraction rather than text generation to provide a controlled, format-agnostic basis for comparison.

The experimental protocol rigorously controls for prompt structure, model provider, size, and diverse task parameters. Four main prompt variants are used to isolate the impact of alignment versus format artifacts; an additional battery of 14 prompt formulations is examined to check robustness against prompt wording, framing, and role assignment.

Main Results: Empirical Findings

Across all multi-round game families, base models surpass aligned counterparts in predicting human decisions by a margin of 9.7:1 (213 base wins vs. 22 aligned wins, $p < 10^{-40}$ ).

Figure 1: Pearson correlations of base models and human decisions (x-axis) vs. aligned models and human decisions (y-axis) across four game families. Each point represents a same-provider model pair; a strong base advantage is observed, especially below the diagonal.

This effect is consistent:

Across model families: Every major provider (Qwen, Gemma, Llama, Falcon, OLMo, DeepSeek, SmolLM) exhibits a persistent base-model advantage, generally scaling positively with model size.
Across prompt and format variants: Even when both models are evaluated with exactly the same plain-text or chat-formatted prompt, the base advantage remains (e.g., 5.0:1 and 5.3:1, respectively), ruling out prompt artifact confounds.
Throughout prompt content variations: Out of 1,003 prompt-variant comparisons, base models win 959 (95.6%).

Notably, in every configuration parameter of each game family except one, the base-model superiority holds. Robustness checks with different token mass and correlation thresholds confirm the stability of these results.

Scaling Analysis

Figure 2: Median Pearson correlation difference (base minus aligned) by model size; the base advantage grows systematically with increasing model scale.

Model scaling accentuates the base advantage, particularly in bargaining and negotiation, implying that larger base models better preserve the diverse, descriptive patterns required to match empirical human behavior.

Boundary Conditions and Reversal of Effect

However, the pattern reverses in scenarios where human behavior conforms more closely to normative theories:

One-shot textbook games: On a large benchmark (2,416 one-shot $2\times2$ games, 93,000 human choices), aligned models achieve a 4.1:1 win ratio, consistently across all 12 game types. The same reversal is observed for non-strategic binary lotteries (2.2:1 aligned-model win rate).
Early rounds in multi-round games: Aligned models outperform base models at round one, but as intertemporal dynamics like reciprocity and history-dependence build, base models gain the advantage.

Aligned models' predictions in one-shot games also closely track Nash equilibrium probabilities, which themselves moderately correlate with human actions in these settings.

Theoretical Implications and Mechanism

These results elucidate a sharp trade-off induced by current alignment methods. RLHF and DPO, by optimizing models toward human-preferred (normative) outputs, perform an exponential tilting of the base model's distribution that collapses behavioral diversity and under-represents the tail behaviors critical for descriptive accuracy in complex, history-dependent strategic settings. Theoretical work clarifies that this distributional narrowing is not incidental but is inherent to the objective, leading to "preference collapse" and a systematic loss of representation for less-approved, but frequently observed, human behaviors.

The selective reversal of performance — with aligned models excelling in settings best described by normative theories, and base models dominating in descriptive, dynamic, or multi-round domains — demonstrates that alignment does not simply reduce overall capacity (e.g., via catastrophic forgetting), but fundamentally shifts the locus of model prediction on the normative-descriptive axis.

Methodological and Practical Consequences

The findings have significant implications for both research and deployment:

Behavioral and social simulation: Using aligned models as behavioral proxies risks modeling normative or idealized versions of human action, potentially misleading results in economics, policy simulation, and social science, especially in domains where strategic behavior is non-normative.
Choice of model: For tasks requiring faithful representation of observed behavior — particularly in multi-shot, interactive, or dynamic strategic contexts — base models offer substantially greater descriptive accuracy. Aligned models may be preferable for decision-support or user-facing applications emphasizing socially approved patterns.
Method development: There is a need for new alignment paradigms that enhance helpfulness and preference-matching while preserving the rich, empirical diversity of human behavior.

Directions for Future Work

Several open problems remain:

Identification of which specific multi-round mechanisms (opponent modeling, trajectory novelty, temporal credit assignment) account for the persistent base advantage.
Extension to continuous action spaces, larger or closed models, and other strategic domains (e.g., auctions, coalition games).
Algorithms that achieve alignment without sacrificing behavioral entropy or diversity, possibly by separately estimating normative and descriptive distributions or regularizing for entropy preservation.

Conclusion

This study demonstrates a robust trade-off in LLM alignment: while alignment enhances predictive accuracy when human choices are normative (textbook or one-shot games, non-strategic lotteries), it degrades descriptive fidelity in multi-round, dynamic settings where actual behavior diverges from normative ideals. This emerges as an inherent property of reward-maximization alignment and calls for more nuanced strategies when employing LLMs as behavioral models in scientific research or simulation. The choice between base and aligned models should be made with explicit consideration of normative-descriptive trade-offs, and alignment pipelines must be revisited to accommodate applications requiring broad behavioral realism.

Markdown Report Issue