Papers
Topics
Authors
Recent
Search
2000 character limit reached

Probe-based Fine-tuning for Reducing Toxicity

Published 24 Oct 2025 in cs.LG | (2510.21531v1)

Abstract: Probes trained on model activations can detect undesirable behaviors like deception or biases that are difficult to identify from outputs alone. This makes them useful detectors to identify misbehavior. Furthermore, they are also valuable training signals, since they not only reward outputs, but also good internal processes for arriving at that output. However, training against interpretability tools raises a fundamental concern: when a monitor becomes a training target, it may cease to be reliable (Goodhart's Law). We propose two methods for training against probes based on Supervised Fine-tuning and Direct Preference Optimization. We conduct an initial exploration of these methods in a testbed for reducing toxicity and evaluate the amount by which probe accuracy drops when training against them. To retain the accuracy of probe-detectors after training, we attempt (1) to train against an ensemble of probes, (2) retain held-out probes that aren't used for training, and (3) retrain new probes after training. First, probe-based preference optimization unexpectedly preserves probe detectability better than classifier-based methods, suggesting the preference learning objective incentivizes maintaining rather than obfuscating relevant representations. Second, probe diversity provides minimal practical benefit - simply retraining probes after optimization recovers high detection accuracy. Our findings suggest probe-based training can be viable for certain alignment methods, though probe ensembles are largely unnecessary when retraining is feasible.

Authors (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.