Drivers of cross-architecture differences in LLM calibration
Determine the causal factors that drive the observed between-architecture differences in magnitude calibration of social inferences among GPT-4o-mini, Claude Sonnet 4 (20250514), and Gemini 2.5 Pro in the numerical (im)precision evaluation, including assessing the potential roles of training objectives, fine-tuning procedures, and response normalization strategies.
References
What drives these between-architecture differences remains an open question: since all three models are closed-source, their training objectives, fine-tuning procedures, and response normalization strategies are not publicly available, and we refrain from drawing strong causal conclusions from behavioral differences alone.
— Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting
(2604.02512 - Mühlenbernd, 2 Apr 2026) in Section: Discussion, Structure Without Calibration paragraph