Drivers of cross-architecture differences in LLM calibration

Determine the causal factors that drive the observed between-architecture differences in magnitude calibration of social inferences among GPT-4o-mini, Claude Sonnet 4 (20250514), and Gemini 2.5 Pro in the numerical (im)precision evaluation, including assessing the potential roles of training objectives, fine-tuning procedures, and response normalization strategies.

Background

The study finds a consistent structure–magnitude dissociation across three frontier models: all reproduce the directional structure of human social inferences, but they diverge in magnitude calibration. GPT aligns closely with human effect sizes, Claude shows moderate but prompt-sensitive deviations, and Gemini exhibits severe magnitude inflation.

Because all three models are closed-source, the authors cannot attribute the behavioral differences to concrete training choices or architectural factors, leading to an explicitly stated open question about what drives these differences.

References

What drives these between-architecture differences remains an open question: since all three models are closed-source, their training objectives, fine-tuning procedures, and response normalization strategies are not publicly available, and we refrain from drawing strong causal conclusions from behavioral differences alone.

Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting  (2604.02512 - Mühlenbernd, 2 Apr 2026) in Section: Discussion, Structure Without Calibration paragraph