Population-level evaluation of LLM mental well-being tools
Determine whether non-clinical Large Language Model tools used for mental well-being support should be evaluated primarily on quantitative, population-level benefits and risks, analogous to the U.S. Food and Drug Administration’s approach to breakthrough drugs; if population-level evaluation is inappropriate, identify and justify alternative evaluation frameworks for these LLM-based consumer tools that better account for individual-level rights, values, and public health considerations.
References
In parallel with efforts to operationalize and extend the framing of responsible design offered in this research, there should also be work that critically examines it. We highlight two open questions already emerging from our participants’ interviews and invite future research to further critique and improve upon our findings. Should we evaluate LLM tools based on population-level benefits and risks? Hypothetically, if an LLM tool can measurably improve the mental well-being of a vast number of users yet poses life-and-death risks to a small subset, do we accept those risks as merely “side effects” of the tool, as FDA regulators sometimes do with breakthrough drugs? Importantly, if the answer is no, what alternative evaluation approaches are more appropriate? The participants in this study were divided on these questions.