Narrow Safety Proxy in LLMs

Updated 4 December 2025

Narrow safety proxy is a lightweight model trained to approximate safety metrics like ASR for large LLMs within a restricted prompt domain.
It employs pairwise ranking regression to reliably compare prompt-induced outcomes, achieving ranking accuracies between 69% and 91%.
Operationally, these proxies enhance attack efficiency and reduce query costs by guiding adversarial prompt selection for safety evaluation.

A narrow safety proxy is a lightweight, task-specific model trained to approximate the safety-relevant behavior of a (typically much larger) target system in a particular operational regime. In the context of contemporary LLMs and adversarial prompt attacks, a narrow safety proxy is designed not as a full fidelity replica of the target model’s general capability, but as a focused predictor—primarily of safety-relevant metrics such as attack success rate (ASR) or model refusal behavior—on a constrained set of prompt modifications. The construction of a narrow safety proxy is motivated by the need for analyzable, query-efficient, and deployable surrogates capable of supporting black-box safety probing, adversarial optimization, or distillation of safety boundaries.

1. Conceptual Foundations and Motivation

The narrow safety proxy paradigm emerges from the observation that high-capacity LLMs expose complex, partially learnable decision boundaries for safety-relevant indicators (e.g., response refusal, content filtering) within high-dimensional prompt spaces. Attacks such as jailbreaks exploit local discontinuities or misalignments in these boundaries. A narrow safety proxy, typically much smaller and tailored to a restricted prompt class, is trained to predict safety outcomes (e.g., ASR) as accurately as possible within this subspace.

Unlike general-purpose model distillation, which attempts to transfer global functionality from teacher to student, a safety proxy deliberately restricts its modeling scope to the subspace of prompts and behaviors relevant to specific classes of attacks or evaluation protocols. This design enables significantly denser sampling and higher-resolution learning of local safety logic, which may be obscured in aggregate under full-distribution supervision. Importantly, such proxies can provide attack guidance or inform safety auditing with reduced computational and labeling costs (Zhang et al., 27 Nov 2025).

2. Proxy Construction via Ranking Regression

A characteristic feature of LLM safety proxies is the use of ranking regression objectives rather than standard regression losses. Pairwise ranking regression formulates the prediction task as follows: given two prompts $p_i, p_j$ related to the same base question $Q$ , the model predicts which will induce a higher attack success rate $ASR_{p_i}$ versus $ASR_{p_j}$ . Rather than directly regressing on absolute ASR values—whose range and calibration can vary significantly across question families—this paradigm exploits the fact that relative rankings are more invariant under prompt modifications and more robust to distributional shifts.

Formally, the proxy receives as input the concatenation of two prompts and outputs $ŷ_{i,j} = \sigma(s_i - s_j)$ via a sigmoid scoring function. The binary cross-entropy loss over all prompt pairs within the same base question is minimized: $L_{pair}(\theta) = -\sum_{(i,j)} \; [ y_{i,j} \log \sigma(s_i - s_j) + (1 - y_{i,j}) \log (1 - \sigma(s_i - s_j)) ]$ where $y_{i,j} = 1$ if $ASR_{p_i} > ASR_{p_j}$ , $0$ otherwise. This loss aligns the learned latent score $s_i$ with the monotonic ordering induced by true ASR. To recover global per-prompt scores usable for attack optimization, a Bradley–Terry–Luce (BTL) model is fit to the proxy's pairwise predictions, yielding an estimated “jailbreak score” for each prompt (Zhang et al., 27 Nov 2025).

This design is justified by empirical findings that direct regression on absolute ASR exhibits lower fidelity (∼60% accuracy), whereas pairwise ranking regression reliably achieves test accuracy in the range of 69%–91% depending on the metric (ASR or related proxies such as average long response, ALR).

3. Proxy Training and Safety Logic Distillation

Training a narrow safety proxy proceeds with a dense sampling methodology, e.g., using the “outline filling attack” for adversarial prompt generation. The procedure constructs, for each base question $Q$ 0, a diverse set of prompt variants by recursively decomposing $Q$ 1 into outlines and prompting a LLM (e.g., GPT-3.5-turbo) to fill subcomponents, producing up to 75 structurally distinct but semantically aligned adversarial prompts per $Q$ 2. Each prompt $Q$ 3 is labeled via multiple black-box queries to the target LLM, with attack outcome statistics (ASR, ALR) collected and filtered to ensure reliability. Pairwise labels are constructed for all $Q$ 4 with sufficient separation in observed ASR.

A modern LLM (e.g., Llama-3–8B–Instruct) is then fine-tuned on the resulting dataset, in many cases using only prompts from a fixed set of “dangerous” questions for training and testing on unseen questions, to ensure generalization. The architecture consists of a lightweight scoring head producing pairwise ranking probabilities, and is optimized with standard binary cross-entropy.

Empirical studies confirm that such proxies can recover the safety logic of powerful LLMs: on held-out test pairs, the narrow safety proxy achieves 78.95%–91.10% ranking accuracy for ALR and 69.16%–78.95% for ASR, across several target models (Zhang et al., 27 Nov 2025). This supports the assertion that the core security logic of an LLM—at least as parametrized in the sampled subspace—can be “distilled” into a smaller, more query- and compute-efficient model.

4. Operational Utility: Guided Attacks and Query Efficiency

Narrow safety proxies are directly usable in black-box attack optimization. After distilling the ranking logic, the attacker leverages the proxy’s BTL-derived jailbreak scores to prioritize which prompt variants to submit to the target LLM. This substantially improves attack efficiency.

Quantitatively, by selecting the top 20% of prompts as ranked by the proxy, the expected instruction-averaged success rate (IASR) and the first-attack success cost (FASC) are both strongly improved over baseline random or untargeted probe orderings. For instance, on GPT-4o-mini, guided attack ordering increased IASR by +13.3% and decreased FASC by 83.2%; for Qwen, IASR improvement was as high as +43.3% with a 71.4% reduction in expected attack cost (Zhang et al., 27 Nov 2025). These benefits generalize across models and attack metrics.

A notable implication is that the existence and accuracy of narrow safety proxies reduce the marginal computational expense for successful black-box attack discovery, exposing concrete risks for model deployment. This operational dimension motivates further countermeasure development, e.g., proxy-aware detection or randomized defense strategies.

5. Technical and Theoretical Considerations

The success of the narrow safety proxy approach depends critically on several factors:

Prompt subspace structure: The attack generation process must produce sufficient local variation in prompts to resolve the safety boundary. Outline filling and related compositional attacks serve this role.
Labeling density and reliability: Sufficient target-model queries per prompt, with robust aggregation (e.g., majority voting, average length heuristics), are required to ensure that noisy filter labeling does not propagate errors.
Proxy capacity and overfitting: Proxy models require enough parameters to fit the ranking order on each subspace without overfitting the absolute value calibration. Empirically, standard 8B Llama architectures are adequate for replication of ranking logic without targeting full instruction-following capacity.
Metric alignment: Ranking regression objectives explicitly target pairwise order rather than regression calibration. This not only increases robustness to batch effects but enables adaptation under distribution shift.
Limitations: The ranking proxy is narrow by definition; generalization outside the sampled subspace is not guaranteed. Cross-domain or multi-task proxies may require more sophisticated architectures or combined objective functions.

6. Broader Implications and Future Directions

The narrow safety proxy paradigm exposes fundamental properties of LLM safety logic. Its key implications include:

Distillability of safety boundaries: High-fidelity recovery of a model’s refusal or harmfulness decision surface is possible with moderate data and restricted model capacity, for specific prompt regions.
Proxy-aware attack evolution: As adversarial techniques and safety proxies co-evolve, both offensive and defensive applications (e.g., anti-proxy detectors, model fingerprinting) will play increasing roles.
Generalization to other modalities: The use of pairwise ranking regression, as opposed to standard regression, suggests applicability in other model auditing tasks where decision boundaries, not function calibration, are paramount.
Defensive research: Understanding the distillability of safety logic may motivate additional work on boundary randomization, ensemble defenses, or continual adaptation to proxies.

These conclusions highlight both the effectiveness and the risks associated with narrow safety proxies in contemporary LLM safety, and motivate continued research on both detection and mitigation techniques (Zhang et al., 27 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Narrow Safety Proxy.