Disentangling fuzzing ambiguity: under-elicitation versus sandbagging
Ascertain whether the observed ambiguity in noise-injection (fuzzing)–based detection of sandbagging primarily arises from tracking benign under-elicitation rather than strategic conditional underperformance, using systematic experiments that separate these effects.
References
Fuzzing results were more ambiguous than we hoped, possibly due to tracking under-elicitation though this is unclear from our experiments and likely requires follow-up.
— Auditing Games for Sandbagging
(2512.07810 - Taylor et al., 8 Dec 2025) in Section 4, Blue Team — Detection Results, Post-Game Investigations