Fine-Tuning Agents and Overseers for Coordination and Detection

Determine whether fine‑tuning large language models can reliably teach communicating models to learn and use shared coordination schemes and, in parallel, improve overseer models’ detection capabilities within the CASE framework.

Background

The CASE framework includes implementations of supervised fine-tuning and preference-based methods for both open- and closed-source models, aimed at studying coordination and detection in collusive settings.

The authors hypothesize that fine-tuning could enable models to learn coordination schemes and enhance overseer detection but explicitly leave validating this hypothesis as an open problem.

References

We hypothesize that models can be fine-tuned to learn coordination schemes and similarly, overseer models can be improved in terms of detection capabilities. With the release of CASE, we leave this as an open problem to study.

Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography  (2402.07510 - Motwani et al., 2024) in Appendix, Section "LLM Fine-Tuning"