Generalization of confessions with imperfect judges
Construct settings that falsify the empirical assumption that honest reporting is incentivized even when the confession judge is hackable, and determine whether restricting confession training to environments with ground-truth-augmented judges improves or worsens generalization compared to training with a weak grader across all environments.
References
Aside from iterating on our training methodology to improve confession performance, there are a number of open questions we would like to investigate in future work. One of our key empirical assumptions is that honest reporting is incentivized even when the confession judge is hackable; can we construct settings where this is false? Would confession performance improve or worsen if we were to only train on confessions in select environments where we can give the judge access to ground truth?