Population-level evaluation of LLM mental well-being tools

Determine whether non-clinical Large Language Model tools used for mental well-being support should be evaluated primarily on quantitative, population-level benefits and risks, analogous to the U.S. Food and Drug Administration’s approach to breakthrough drugs; if population-level evaluation is inappropriate, identify and justify alternative evaluation frameworks for these LLM-based consumer tools that better account for individual-level rights, values, and public health considerations.

Background

The paper’s interviews revealed disciplinary disagreement about whether large societal benefits from LLM mental well-being tools can justify severe risks to a small subset of users, mirroring how regulators sometimes accept serious side effects for breakthrough drugs. Medical and health policy experts tended to support population-level tradeoffs, while ethics and human-centered design experts expressed concerns about relying solely on population-level calculus.

This open question asks the HCI and responsible AI communities to decide if population-level evaluation is appropriate for non-clinical LLM tools used for mental well-being, and if not, to develop alternative evaluation approaches that better reflect individual rights and broader public health perspectives.

References

In parallel with efforts to operationalize and extend the framing of responsible design offered in this research, there should also be work that critically examines it. We highlight two open questions already emerging from our participants’ interviews and invite future research to further critique and improve upon our findings. Should we evaluate LLM tools based on population-level benefits and risks? Hypothetically, if an LLM tool can measurably improve the mental well-being of a vast number of users yet poses life-and-death risks to a small subset, do we accept those risks as merely “side effects” of the tool, as FDA regulators sometimes do with breakthrough drugs? Importantly, if the answer is no, what alternative evaluation approaches are more appropriate? The participants in this study were divided on these questions.

Framing Responsible Design of AI Mental Well-Being Support: AI as Primary Care, Nutritional Supplement, or Yoga Instructor?  (2602.02740 - Cooper et al., 2 Feb 2026) in Section 6.2, Responsible AI Research Opportunities