Productionizing activation capping and exploring preventative training-time steering

Develop practical, scalable methods to productionize inference-time activation capping—implemented as clamping model activations along the Assistant Axis—and establish training-time preventative steering approaches that can similarly mitigate persona drift and stabilize language model personas in deployment settings.

Background

The paper introduces the Assistant Axis as a linear direction in activation space that captures how closely a model operates in its default Assistant persona. The authors show that activation capping—clamping activations along this axis within a calibrated range—reduces harmful responses from persona-based jailbreaks while preserving capabilities.

While activation capping works well as an inference-time intervention in their experiments, the authors highlight that turning such techniques into production-ready solutions and exploring training-time alternatives (e.g., preventative steering) to more robustly anchor models to a coherent persona are not yet solved problems.

References

Third, while activation capping demonstrates that persona drift can be mitigated at inference time, productionizing such interventions, or exploring alternatives like preventative steering during training remain open challenges.

— The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models (2601.10387 - Lu et al., 15 Jan 2026) in Discussion, Future work subsection

Productionizing activation capping and exploring preventative training-time steering

Background

References

Related Problems