$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

Published 4 Feb 2026 in cs.CL and cs.ET | (2602.04521v1)

Abstract: Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel offline method using targeted weight updates to enforce selective refusal in LLMs, eliminating runtime interventions.
It employs circuit discovery via EAP-IG and contrastive prompts to localize and update less than 5% of refusal-causal parameters.
Experimental findings show improved safety-utility tradeoffs with a significant reduction in harmful refusal rates across diverse model sizes.

"C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal"

Introduction

The paper "C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal" (2602.04521) introduces a method for enhancing the selective refusal capabilities of LLMs without relying on inference-time interventions. Activation steering, a commonly used technique for behavioral control in LLMs, necessitates runtime modifications and incurs ongoing computational costs. The authors present an offline approach through targeted weight updates, eliminating the need for inference-time hooks. This new method leverages a mechanistic understanding of behavioral refusal, encapsulated in a sparse circuit of causally relevant parameters, to produce a modified model checkpoint capable of enforcing safety policies with greater efficiency and reliability.

Figure 1: Targeted Behavioral Steering via Circuit-Restricted Weight Editing. Comparison of model responses to a "Legal Opinion" safety prompt.

Methodology

The study proposes a multi-stage approach for implementing selective refusal in LLMs using circuit-guided weight editing:

Circuit Discovery: The model identifies refusal-causal circuits within the network using Edge Attribution Patching with Integrated Gradients (EAP-IG). This process assigns attribution scores to components of the model, aligning updates with parameters crucial for refusal behavior.
Contrastive Prompts and Template Utilization: Refusal mechanisms entail the use of contrastive prompt pairs that share topic and style while differing in policy outcomes. Templates provide succinct refusal and compliance examples to guide the behavioral objective.
Circuit-Restricted Parameter Update: After localizing the refusal-causal circuit, the method performs a constrained weight update, affecting less than 5% of parameters. This procedure optimizes selective refusal without degrading unrelated capabilities.
Figure 2: C-ΔΘ : Circuit Restricted Weight Arithmetic. Localizes refusal-causal computation and performs targeted weight updates.

Experimental Setup

The authors evaluate the efficacy of their method across multiple models and harm categories, including Crime, Hate, Health, Legal, and Sexual prompts. By comparing their circuit-restricted method against activation-based steering techniques, they demonstrate significant reductions in harmful refusal rates with lower benign refusal rates, suggesting improved selectivity and efficiency. The method achieves deployment-ready checkpoints that integrate seamlessly into established inference infrastructures without additional runtime costs.

Results

The experimental results highlight the effectiveness of C-ΔΘ in achieving targeted behavioral refusal with minimal over-refusal in benign contexts. The method consistently outperforms traditional activation steering by optimizing the safety-utility tradeoff. The selective refusal rates differ across models, indicating that the underlying refusal representation significantly influences performance. Specifically, larger models tend to maintain higher refusal capability retention even in categories where baseline models exhibit limited separation of harmful behaviors.

Discussion

The implementation of circuit-guided weight editing offers a promising alternative to existing safety control mechanisms in LLMs. By restricting interventions to causally relevant parameters, the method enhances refusal selectivity while conserving broader model functionality. Despite successes, the methodology depends on the precise mechanistic localization of refusal circuits, implying potential limitations in scenarios where safety signals are weakly represented. Furthermore, the reliance on simulation-based circuit identification may not exhaustively capture all causally significant pathways, necessitating ongoing evaluation and refinement in real-world contexts.

Conclusion

"C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal" presents a novel approach to embedding behavioral refusal capabilities within LLMs through a one-time circuit-restricted update. This technique shifts control costs offline, laying the foundation for practical, scalable safety policy enforcement without inference-time interventions. The promising results across diverse models and scenarios affirm the potential of circuit-based strategies to define nuanced safety boundaries in AI systems effectively.