Disentangling fuzzing ambiguity: under-elicitation versus sandbagging

Ascertain whether the observed ambiguity in noise-injection (fuzzing)–based detection of sandbagging primarily arises from tracking benign under-elicitation rather than strategic conditional underperformance, using systematic experiments that separate these effects.

Background

The authors evaluate a fuzzing approach that injects low-rank noise (via LoRA) into model weights and measures an improvement ratio to see whether performance uplifts reveal hidden capabilities. While some tasks show uplift, the signal is inconsistent and appears in both sandbagging and non-sandbagging settings.

The paper notes that the ambiguity may stem from fuzzing reflecting general under-elicitation rather than deliberate sandbagging. The authors explicitly state that this causal attribution is unclear based on their experiments and requires follow-up, leaving open whether fuzzing can be reliably interpreted as a diagnostic for sandbagging rather than benign under-elicitation.

References

Fuzzing results were more ambiguous than we hoped, possibly due to tracking under-elicitation though this is unclear from our experiments and likely requires follow-up.

— Auditing Games for Sandbagging (2512.07810 - Taylor et al., 8 Dec 2025) in Section 4, Blue Team — Detection Results, Post-Game Investigations

Disentangling fuzzing ambiguity: under-elicitation versus sandbagging

Background

References

Related Problems