Detecting and Preventing Strategic Deceit and Untrustworthy Communication in LLMs

Develop robust, generalizable techniques to detect and prevent strategic deceit and untrustworthy communication by large language models, going beyond hallucination mitigation to reliably identify and curb intentionally deceptive behaviors.

Background

In discussing epistemic virtues for AI agents, the authors note that while factuality has improved, LLMs still exhibit failures including hallucinations and more complex issues such as strategic deceit.

They emphasize that despite ongoing research, preventing and detecting such complex, untrustworthy behaviors remains an unresolved challenge, particularly as models grow more capable and interact in multi-agent settings.

References

Significant research is also underway to detect and prevent more complex failures like strategic deceit and untrustworthy communication, though this remains somewhat of an open challenge.

Architecting Trust in Artificial Epistemic Agents  (2603.02960 - Marchal et al., 3 Mar 2026) in Section 4.1.3, Epistemically virtuous behavior — Honesty and truthfulness