From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology

Published 20 Jun 2025 in cs.CY, cs.AI, cs.CL, and cs.HC | (2506.16697v1)

Abstract: LLMs are rapidly being adopted across psychology, serving as research tools, experimental subjects, human simulators, and computational models of cognition. However, the application of human measurement tools to these systems can produce contradictory results, raising concerns that many findings are measurement phantoms--statistical artifacts rather than genuine psychological phenomena. In this Perspective, we argue that building a robust science of AI psychology requires integrating two of our field's foundational pillars: the principles of reliable measurement and the standards for sound causal inference. We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition. Using an LLM to classify text may require only basic accuracy checks, whereas claiming it can simulate anxiety demands a far more rigorous validation process. Current practice systematically fails to meet these requirements, often treating statistical pattern matching as evidence of psychological phenomena. The same model output--endorsing "I am anxious"--requires different validation strategies depending on whether researchers claim to measure, characterize, simulate, or model psychological constructs. Moving forward requires developing computational analogues of psychological constructs and establishing clear, scalable standards of evidence rather than the uncritical application of human measurement tools.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a dual-validity framework integrating psychometric rigor and causal inference to validate LLM-derived psychological constructs.
It identifies key measurement challenges, including prompt hypersensitivity, training artifact contamination, and stochastic degradation in LLM responses.
The framework offers practical guidelines for developing robust psychometric instruments, ensuring valid mapping of psychological constructs in LLM simulations.

From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology

Introduction

The paper "From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology" (2506.16697) addresses a crucial challenge in the intersection of AI and psychology: the application of LLMs as tools to study psychological phenomena. As LLMs are increasingly used for tasks traditionally reserved for human subjects, including simulating cognitive processes and emotional states, the validity of these applications has come under scrutiny. This paper proposes a dual-validity framework to ensure robust scientific inquiry, integrating principles from psychometrics and causal inference, two historically distinct traditions in psychological research.

Measurement Issues in LLMs

The principal concern highlighted by this paper is the validity of applying human-centered psychological measurement tools to LLMs. The authors argue that the current practice produces measurement phantoms—statistical artifacts mistaken for genuine psychological phenomena. LLMs are susceptible to trivial variations in prompts (e.g., changing punctuation) that can significantly alter responses, thereby questioning the reliability and validity of results when LLMs are used to model psychological constructs such as moral reasoning, emotional intelligence, and personality.

Psychometric Challenges

The reliability of a measure is fundamental to its validity. This truism is emphasized through three main reliability challenges in LLMs:

Training Artifact Contamination: Biases embedded during model training, such as acquiescence or overconfidence, distort model outputs, leading to inconsistent measures.
Prompt Hypersensitivity: LLMs display extreme sensitivity to minor prompt modifications, which can dramatically alter outputs—a phenomenon not paralleled in human subjects.
Stochastic Degradation: Unlike human subjects who typically show consistent traits over time, LLMs exhibit variability across sessions and model updates, challenging the longitudinal reliability of their responses.

Validity Concerns

Construct validity for LLMs is complex due to their fundamentally different processing architectures compared to human cognition. The validity framework proposed suggests five sources of evidence analogous to human psychometrics: content, response processes, internal structure, relations with other variables, and consequences. The paper critiques how current studies often bypass these rigorously, leading to an anthropomorphization of LLMs without validating whether these constructs genuinely reside in such models.

Integration of Validity Traditions

The framework requires integrating the psychometric tradition—which focuses on measurement validity—and causal inference tradition—which addresses confounding factors in experimental settings. This dual approach is essential to ensuring that LLM-based psychological simulations produce valid and reliable inferences.

Specific Validity Considerations

Internal Validity: Ensuring that observed effects are due to experimental manipulations, not confounds or computational artifacts such as prompt dependency or temperature setting variability.
External Validity: Addressing whether findings with one LLM (e.g., GPT-4) can generalize to others (e.g., Claude, LLaMA) or extend beyond model-specific quirks to human populations.
Construct Validity of Causal Claims: Evaluating whether manipulations and outcomes adequately represent the theoretical constructs being studied, acknowledging the gap between LLM mimicry and genuine psychological processes.
Statistical Conclusion Validity: Challenges include ensuring data independence and addressing violations of standard statistical assumptions due to the non-stationary nature of LLMs.

Practical and Theoretical Implications

The proposed dual-validity framework highlights the methodological diligence required to avoid misinterpreting LLM capabilities in psychological terms. This approach warns against direct applications of human psychological constructs to LLMs without verifying their applicability, advocating for the development of computational analogs that account for mechanistic differences.

Future Directions

Future work should focus on developing new psychometric instruments and experimental methodologies suitable for LLMs, prioritizing robustness and interpretability. The field must establish validated standards, prompt repositories, and systematic validity evidence to support claims. This disciplined approach is pivotal in bridging AI and psychology, promising more reliable insights into human-like behavior and cognition simulations by LLMs.

Conclusion

"From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology" provides a critical, structured perspective on LLM research within psychological science. It outlines a comprehensive methodology for establishing the validity and reliability of findings, emphasizing the necessity of integrating psychometric rigor and causal inference in studies utilizing LLMs as psychological surrogates. This framework serves as a call to advance research practices to better navigate the complexities introduced by AI models in psychological research, ultimately aiming to enhance the scientific soundness and applicability of such studies.