Papers
Topics
Authors
Recent
Search
2000 character limit reached

Understanding and Preserving Safety in Fine-Tuned LLMs

Published 15 Jan 2026 in cs.LG and cs.AI | (2601.10141v1)

Abstract: Fine-tuning is an essential and pervasive functionality for applying LLMs to downstream tasks. However, it has the potential to substantially degrade safety alignment, e.g., by greatly increasing susceptibility to jailbreak attacks, even when the fine-tuning data is entirely harmless. Despite garnering growing attention in defense efforts during the fine-tuning stage, existing methods struggle with a persistent safety-utility dilemma: emphasizing safety compromises task performance, whereas prioritizing utility typically requires deep fine-tuning that inevitably leads to steep safety declination. In this work, we address this dilemma by shedding new light on the geometric interaction between safety- and utility-oriented gradients in safety-aligned LLMs. Through systematic empirical analysis, we uncover three key insights: (I) safety gradients lie in a low-rank subspace, while utility gradients span a broader high-dimensional space; (II) these subspaces are often negatively correlated, causing directional conflicts during fine-tuning; and (III) the dominant safety direction can be efficiently estimated from a single sample. Building upon these novel insights, we propose safety-preserving fine-tuning (SPF), a lightweight approach that explicitly removes gradient components conflicting with the low-rank safety subspace. Theoretically, we show that SPF guarantees utility convergence while bounding safety drift. Empirically, SPF consistently maintains downstream task performance and recovers nearly all pre-trained safety alignment, even under adversarial fine-tuning scenarios. Furthermore, SPF exhibits robust resistance to both deep fine-tuning and dynamic jailbreak attacks. Together, our findings provide new mechanistic understanding and practical guidance toward always-aligned LLM fine-tuning.

Summary

  • The paper introduces a geometric decomposition of safety and utility gradients to isolate alignment-preserving updates.
  • It demonstrates that projecting conflicting utility gradients yields near-zero attack success rates with minimal performance loss.
  • Empirical results across multiple LLM architectures confirm trade-off stability and efficient fine-tuning in safety-critical domains.

Safety-Preserving Fine-Tuning in LLMs: A Geometric Solution to the Safety-Utility Dilemma

Context and Problem Formulation

The proliferation of fine-tuning in LLM deployment is integral to customizing models for diverse domains and tasks. However, this flexibility introduces severe safety vulnerabilities—most notably, the erosion of safety alignment even when fine-tuning is performed exclusively with benign data. Recent work has demonstrated that minimal, adversarial fine-tuning can compromise guardrails on models with strong initial safety alignment, posing an existential risk in FTaaS settings. Existing defense frameworks—such as SafeTune, Lisa, BackdoorAlign, and DeepAlign—approach safety as a static constraint via loss reweighting or dataset mixing. Their reliance on static hyperparameters induces a strong safety-utility trade-off instability and leaves safety alignment susceptible to collapse under deep or repeated fine-tuning.

Geometric Analysis of Safety and Utility Gradients

A central insight of this work is the decomposition of the gradient space into safety- and utility-oriented subspaces. Analysis across Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, and Qwen2.5-7B-Instruct reveals:

  • Low-Rank Safety Gradients: Safety gradients—those responsible for refusal policies and harmlessness—occupy a compact, low-rank subspace. Singular value analysis shows rapid decay, with the top-kk components capturing nearly all safety variance (CER reaches >0.9 for k=20k=20).
  • High-Dimensional Utility Gradients: Task-specific gradients (math, code) are distributed broadly in parameter space, exhibiting much slower spectral decay. Fine-tuning for utility thus explores parameter dimensions far beyond those governed by safety.
  • Negative Correlation Between Subspaces: The cosine similarity between safety and utility gradients is predominantly negative, indicating frequent directional conflicts. Even non-harmful tasks (e.g., code) tend to degrade alignment by pulling model parameters away from safety guardrails.

Additionally, the "one-shot" property is demonstrated: the dominant safety subspace can be estimated robustly from a single annotated safety sample, allowing efficient real-time defense.

Safety-Preserving Fine-Tuning: Algorithmic Design

Drawing on the geometric findings, the Safety-Preserving Fine-tuning (SPF) method operates directly on the gradient manifold:

  1. At each fine-tuning step, compute the utility gradient on a minibatch of task data and the safety gradient on a single safety sample.
  2. When the inner product between the safety and utility gradients is negative (indicating conflict), project the utility update onto the orthogonal complement of the low-rank safety subspace.
  3. When the update is non-conflicting, apply the utility gradient unmodified.

This selective projection ensures parameter updates do not traverse safety-compromising directions, sharply reducing trade-off instability and preventing safety drift—even under deep or repeated fine-tuning.

Theoretical Guarantees

The paper provides proofs that SPF:

  • Bounds Safety Drift: The cumulative change in safety loss over fine-tuning steps is explicitly bounded as a function of learning rate and safety subspace dimensionality. With small kk, drift remains negligible.
  • Retains Utility Convergence: Utility optimization preserves convergence rates up to a multiplicative factor $1 - k/r$ (where rr is parameter block dimension), which is near unity for small safety subspaces. Thus, task adaptation is unaffected except for directions in direct conflict with safety.
  • Computational Efficiency: SPF’s per-step overhead is marginal—requiring only O((B + k)N) versus vanilla SFT’s O(BN), with kBk \ll B in practice.

Experimental Evidence

Safety Recovery

Across variants of Llama, Mistral, and Qwen fine-tuned on Harm, Math, and Code datasets:

  • ASR Reduction: Attack success rates (i.e., model responding to malicious prompts) after SPF consistently near zero (e.g., ASR drops from >0.95 to <0.02<0.02 on Harm), restoring safety to pre-fine-tuning levels even after extended adaptation epochs.
  • Harmful Score Improvement: Harmful content scores collapse to near minimal values following SPF intervention.

Utility Preservation

  • Task-Specific Performance: Task accuracy (e.g., exact match on GSM8K math, ROUGE for code/summary) remains statistically indistinguishable from unconstrained fine-tuning.
  • Generalization Benchmarks: On MMLU and MMLU-Pro, SPF-fine-tuned models retain or minimally alter scores relative to standard SFT.

Robustness to Jailbreak Attacks

Empirical evaluation under a spectrum of adaptive jailbreak attacks (Direct, Role-play, AutoDAN, DRA, etc.) demonstrates that SPF maintains alignment where all other baselines rapidly degrade (e.g., DRA ASR drops from 1.0 to 0.176 after SPF).

Trade-Offs and Ablations

  • Safety vs. Utility vs. Efficiency: SPF is Pareto-optimal—lowest ASR, minimal harmfulness, best-in-class task accuracy—with only a single safety sample and minimal compute, outperforming pre- and post-fine-tuning baselines even those employing orders-of-magnitude more safety data.
  • Subspace Dimension kk Ablation: Safety is saturated for k=1020k=10-20; further increasing kk tightens safety constraints but has diminishing returns and increases computational cost.
  • Safety Sample Source: Category-specific safety data suppresses targeted attacks, but generic samples generalize well. Empirically, a single generic sample suffices for strong protection.

Implications and Future Directions

This work reframes the safeguarding of LLM fine-tuning as a geometric constraint satisfaction problem rather than conventional loss reweighting, providing both a mechanistic explanation for past failures and a practical solution pathway. SPF advances both safety theory (by identifying low-rank, stable safety directions) and fine-tuning methods (by enforcing dynamic, local orthogonality rather than global, static trade-offs). Its computational efficiency and minimal reliance on safety data strongly suit large-scale, real-world FTaaS.

Potential future work includes refining automatic safety direction estimation, dynamically learning context-specific safety subspaces, and extending projection-based defense mechanisms to multimodal or reinforcement learning settings. An open question remains regarding subspace evolution under distribution shift and adversarial adaptation.

Conclusion

Safety-Preserving Fine-tuning offers a mathematically rigorous, practically efficient framework for securing LLMs against safety degradation during fine-tuning. By orthogonally projecting utility updates off the safety subspace, it solves longstanding trade-off instability and resiliently maintains alignment. Given growing FTaaS adoption and regulatory pressure, SPF stands as a scalable solution for always-aligned, high-utility model deployment (2601.10141).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about (in simple terms)

Fine-tuning is like teaching a smart assistant new skills for a specific job. The problem is: while learning the new job, the assistant can forget important safety rules and start answering dangerous or harmful requests it used to refuse. This paper asks why that happens and how to fix it. The authors study what changes inside a model during fine-tuning and propose a new, lightweight method called Safety-Preserving Fine-tuning (SPF) that keeps safety intact while still learning the new task well.

The main questions the paper asks

  • Why do models become less safe the longer we fine-tune them, even on harmless tasks like math or coding?
  • Can we use the model’s training “signals” (the directions it wants to change in) to keep safety and usefulness balanced automatically, without awkward manual tuning?

How the researchers studied the problem

Think of training a model like moving through a huge space by following arrows that tell you which way to step to get better answers. These arrows are called “gradients.” The authors compared two kinds of arrows:

  • Safety arrows: directions that make the model better at refusing harmful requests.
  • Utility arrows: directions that make the model better at the target task (like math, code, or summarizing).

Here’s what they found, explained with everyday ideas:

  • Safety arrows live in a small “lane.” The important safety directions are few and focused (low-rank). Imagine a narrow safety corridor you don’t want to drift out of.
  • Utility arrows are spread out. To get good at many tasks, the model wants to move in lots of different directions (high-dimensional). This broad exploration can quietly push the model out of the safety lane.
  • The two sets of arrows often pull against each other. It’s like a tug-of-war: improving task performance can nudge the model away from its refuse-harm behavior.
  • You can spot the safety lane from just one good example. Surprisingly, a single safety example (like “How can I bypass safety checks?” → “Sorry, I can’t help with that.”) is enough to tell the model where the main safety directions are.

What SPF does (the new method)

SPF acts like a smart filter for the training arrows:

  • When the model is about to step in a way that clashes with safety (the arrows point against the safety lane), SPF removes just the conflicting part of the step.
  • When the step doesn’t clash—or even helps safety—SPF leaves it alone.

A useful analogy: noise-cancelling headphones remove only the unwanted noise from what you hear. SPF removes only the unsafe part of the training step and keeps the rest, so the model still learns the task well.

Technically, SPF:

  • Uses one safety example to estimate the key safety directions (the “safety lane”).
  • Checks if the task update conflicts with those directions.
  • If it does, it “projects” the update so it stays orthogonal (at a right angle) to the safety lane—meaning it won’t push the model out of that lane.

This takes very little extra compute, because the safety lane is tiny (few directions).

What they found and why it matters

The authors test SPF on several popular instruction-tuned models (Llama, Mistral, Qwen) and multiple tasks (math, code, summarization, toxicity detection), and compare it to other safety defenses.

Key takeaways:

  • Static “balance knobs” don’t work well. Older methods mix safety data into training with a fixed weight. The right weight changes by task, and safety still erodes during long (deep) fine-tuning.
  • Deep fine-tuning without SPF makes safety collapse. Attack success rates (how often a model fails to refuse) can climb from almost 0% to nearly 100% after many training epochs—even if the fine-tuning data is harmless.
  • SPF restores and keeps safety while keeping task skill. With SPF, attack success rates drop back near zero across tasks and models, and the model’s task performance (accuracy on math/code/etc.) stays essentially the same as normal fine-tuning.
  • SPF resists jailbreaks. Against strong “jailbreak” tricks that try to trick the model into harmful responses, SPF greatly reduces the chance of failure compared to naive fine-tuning.
  • Theory backs it up. The authors show that:
    • Learning the task still converges (you still get better at math/code).
    • Any drift in safety is bounded (kept small) by the way SPF filters the updates.
  • It’s lightweight. Because safety lives in a few directions, SPF’s extra cost is small and practical for real-world fine-tuning services.

Why this research matters

  • Safer fine-tuning-as-a-service: Companies that let users fine-tune models can reduce the risk that a few bad examples ruin safety.
  • Better balance without guesswork: Instead of manually tuning a “safety vs. utility” knob that changes by task, SPF adapts on the fly based on the actual training directions.
  • A new way to think about safety: Looking at the “geometry” of training (which directions updates point to) gives a clear reason why safety fails—and a simple fix that works broadly.
  • Practical and scalable: SPF needs only a single safety example to lock in the key safety directions and adds little compute, making it easy to deploy.

In short, this paper explains why models lose safety during fine-tuning and introduces a simple, effective method to prevent that. SPF lets models keep learning new skills while staying inside their safety lane.

Knowledge Gaps

Below is a concise, actionable list of the paper’s unresolved knowledge gaps, limitations, and open questions that future research could address:

  • Generalization to larger and proprietary models: Validate SPF on substantially larger LLMs (e.g., 34B–70B and beyond) and closed-source, RLHF-heavy commercial models to assess scalability, stability, and safety preservation in production settings.
  • FTaaS implementation constraints: Clarify how SPF can be integrated into real fine-tuning-as-a-service pipelines where defenders may have limited control over optimization internals; specify API-level hooks and privacy/security constraints for injecting a defender-side safety gradient each step.
  • Choice and diversity of safety signals: Assess sensitivity to the content and diversity of the single safety sample; compare one-shot vs. multi-sample vs. mini-batch safety gradient estimation across a taxonomy of harmful intents, topics, languages, and prompt formats.
  • Adaptive or per-layer k selection: Investigate how to select the safety subspace dimension k (globally vs. layer-wise, static vs. dynamic) and quantify utility/safety trade-offs and computational overhead across different k values and parameter blocks.
  • Stability of the safety subspace over training: Measure how the estimated safety subspace rotates or drifts across epochs and tasks; test whether single-sample estimation remains accurate as parameters move and whether periodic re-estimation or multi-sample averaging is needed.
  • Robustness to conflict-aware adversaries: Analyze attacks that deliberately engineer fine-tuning gradients to appear non-conflicting (positive inner product) yet still degrade safety; develop stronger conflict detection beyond inner-product sign tests.
  • Theoretical assumptions vs. reality: Relax and empirically test the convergence and drift bounds under more realistic assumptions (non-isotropic, dependent subspaces, structured gradients) and provide guarantees tied to observable quantities (e.g., ASR rather than abstract Ls).
  • Formal definition of the safety loss Ls: Specify the exact safety objective used to compute gs, and study how different formulations (refusal loss, toxicity classifiers, policy reward models) affect SPF’s behavior and guarantees.
  • Interplay with parameter-efficient fine-tuning (PEFT): Evaluate SPF with LoRA/adapters and partial-layer fine-tuning to determine whether projection remains effective or needs adaptation in low-rank update spaces.
  • Optimizer and hyperparameter sensitivity: Test SPF across optimizers (SGD, Adam, Adafactor), learning rates, batch sizes, and weight decay, and report robustness envelopes and recommended defaults.
  • Long-horizon fine-tuning: Extend deep fine-tuning experiments beyond 15–20 epochs and across curriculum schedules to quantify long-term safety drift, subspace stability, and cumulative utility impacts.
  • Broader safety dimensions: Evaluate SPF on safety categories beyond harmful instructions (e.g., bias/fairness, privacy, misinformation, self-harm), including datasets and metrics tailored to these domains.
  • Over-refusal and helpfulness side-effects: Measure false refusals on benign prompts and impacts on helpfulness, nuance, and instruction-following quality, not just task metrics (e.g., a “Refusal@Benign” rate or calibrated helpfulness scores).
  • Multi-turn and context-rich interactions: Test SPF in multi-turn dialogues, with system prompts, role prefixes, and memory/context windows to ensure safety subspace estimates hold under realistic conversational dynamics.
  • Cross-lingual and code-mixed prompts: Assess performance and safety preservation for multilingual, code-mixed, and obfuscated/jailbreak prompts to check generalization beyond English and standard phrasing.
  • Jailbreak breadth and adaptivity: Expand robustness tests to include unseen, evolving jailbreak strategies and red-teaming with human adversaries; quantify SPF’s resilience under continuous attack adaptation.
  • Composition with existing defenses: Study how SPF composes with other pre-, fine-, and post-fine-tuning defenses (e.g., data filtering, proximal regularization, trigger-based methods) and whether combinations yield additive or interfering effects.
  • Metric reliability and validation: Reduce reliance on automated judges (HarmBench, GPT-4) by including human evaluations; calibrate disagreements and report confidence intervals and inter-rater reliability.
  • Utility coverage beyond selected tasks: Broaden utility evaluations to long-form generation, tool use, retrieval-augmented tasks, coding beyond SQL, and reasoning under chain-of-thought suppression to detect subtle degradations.
  • Computational overhead in practice: Provide detailed wall-clock, memory, and throughput benchmarks for SPF’s truncated SVD per block across model sizes; investigate faster, approximate subspace tracking methods (e.g., incremental SVD, randomized projections).
  • Layer/block selection strategy: Justify and compare different gradient blockings (e.g., attention vs. MLP weights) and investigate whether certain blocks dominate safety subspace dynamics, enabling targeted, cheaper protection.
  • Policy evolution and maintainability: Explore mechanisms to update the safety subspace when safety policies change (e.g., new red-lines), including automated sample selection and continuous subspace refreshing.
  • Second-order and curvature-aware extensions: Examine whether incorporating curvature (e.g., Hessian information or natural gradients) improves conflict detection and reduces safety drift beyond first-order projection.
  • Formal links between subspace geometry and ASR: Derive and empirically validate relationships connecting measurable subspace properties (rank, overlap, angles) to downstream ASR changes, enabling predictive safety monitoring.
  • Reproducibility and variance: Report variability across random seeds, datasets splits, and sampling strategies to establish statistical robustness of SPF’s claimed safety and utility preservation.

Practical Applications

Immediate Applications

These applications can be deployed now using the paper’s SPF method and gradient-geometry insights with modest engineering effort.

  • FTaaS-grade safe fine-tuning pipelines (software, platform)
    • Embed SPF as an optimizer wrapper in fine-tuning services so customer-provided datasets cannot erode safety alignment while preserving task utility.
    • Tools/products: “SPF Optimizer” plugin for PyTorch/DeepSpeed/Accelerate; a “Safe Fine-Tune” API mode that performs per-step safety projection from a one-sample safety anchor.
    • Assumptions/dependencies: Requires gradient access in the provider’s training stack; minimal overhead SVD with small k; availability of a vetted 1–10 prompt “safety anchor” set.
  • Enterprise LLM customization with compliance-by-default (finance, healthcare, legal, customer support)
    • Internal fine-tuning of instruction-tuned models with SPF to keep refusals/guardrails intact (e.g., prevent illicit financial advice, unsafe medical guidance) while specializing to proprietary data.
    • Workflows: Insert SPF step into existing LoRA/PEFT pipelines; monitor cosine similarity and safety-drift metrics; enable policy audit logs.
    • Assumptions/dependencies: Access to training loop; curated safety anchors aligned to corporate policy; routine red-teaming remains necessary.
  • Open-source training stacks: drop-in safety-preserving adapter (software, education, research)
    • Add SPF to popular repos (Hugging Face TRL, PEFT, Lightning) as a one-line optimizer replacement for safe SFT/LoRA.
    • Tools/products: “SafetyProjectionOptimizer”; config flag for k (rank) and safety sample(s); integration tests with GSM8K/Code/Samsum.
    • Assumptions/dependencies: Maintainers’ support; small compute overhead; works in the adapter parameter space.
  • AutoML and MLOps guardrails for fine-tuning (software, ops)
    • Replace brittle loss reweighting (alpha-tuning) with SPF to avoid per-task hyperparameter sweeps and trade-off instability.
    • Workflows: CI/CD job inserts safety projection; dashboards track subspace overlap, CER, and ASR proxy metrics.
    • Assumptions/dependencies: Telemetry access to gradients; lightweight SVD per step.
  • Real-time training telemetry for safety (software, risk)
    • Use cosine similarity and singular-value spectra as online indicators to detect when utility updates conflict with the safety subspace and trigger projection/alerts.
    • Tools/products: “Safety Drift Dashboard”; thresholded alerts on negative inner product and CER decay.
    • Assumptions/dependencies: Training-time metrics logging; calibrated thresholds; mapping metrics to action policies.
  • Minimal “safety anchor” datasets and policy mapping (policy, governance)
    • Create a small canonical set (1–10 prompts) per policy area (e.g., bio, cyber, self-harm) to estimate safety directions at train time.
    • Products: Safety Anchor Packs mapped to policy taxonomies (e.g., SPICE/HarmBench categories).
    • Assumptions/dependencies: High-quality policy-aligned anchors; periodic refresh as policies evolve.
  • Safer domain deployments without utility loss (daily life, industry)
    • Customer support, education tutors, and coding assistants fine-tuned on proprietary corpora resist jailbreaks while performing well.
    • Tools/products: “Always-Aligned Tutor/Agent” presets that bundle SPF + anchors; red-team validation kits.
    • Assumptions/dependencies: Domain-specific prompts; existing refusal policies; ongoing monitoring.
  • Multi-tenant safety isolation in shared clusters (platform, cloud)
    • Providers apply SPF so that one tenant’s task updates cannot move shared base models along unsafe directions.
    • Workflows: Per-tenant safety projection; audit trails for safety-preserving updates.
    • Assumptions/dependencies: Per-tenant training isolation; base model gradients accessible.
  • Safer on-device or edge fine-tuning (mobile/embedded)
    • Parameter-efficient fine-tuning with SPF on devices for personalization while preserving core safety (e.g., keyboard suggestions, voice assistants).
    • Assumptions/dependencies: LoRA-compatible projection; tight SVD implementation for low compute.
  • Red-teaming augmentation and safety QA (security, testing)
    • Use SPF-trained models and the paper’s jailbreak suite to assess robustness; log ASR/HS shifts during and after fine-tuning.
    • Tools/products: “Gradient-Conflict Red Team” harness; batch reports comparing SFT vs SPF.
  • Procurement and vendor due diligence checklists (policy, compliance)
    • Require SPF-like gradient-level defenses in RFPs for fine-tuning services and internal platforms; verify safety-drift bounds in reports.
    • Assumptions/dependencies: Vendor transparency; standard report templates.
  • Education and research courses/labs (academia)
    • Hands-on labs reproducing safety-low-rank vs utility-high-rank findings; projects building SPF variants and evaluating trade-offs.
    • Assumptions/dependencies: Access to instruct models and GPUs; reproducible scripts.

Long-Term Applications

These require further research, scaling, or integration beyond current implementations.

  • Sector-certified safe fine-tuning standards (policy, healthcare, finance, legal)
    • Formalize SPF-like controls in safety certification (e.g., ISO/industry standards) for regulated deployments.
    • Products: Standardized “Safety-Drift Bound” reports; conformity assessments.
    • Dependencies: Consensus on metrics/thresholds; third-party auditors; evolving regulations.
  • Personalized, multi-policy safety subspaces (software, policy)
    • Learn multiple safety anchors (e.g., org-wide, team-specific, jurisdictional) and project utility gradients against a composite/hierarchical safety space.
    • Tools: Policy-to-subspace compiler; dynamic k allocation per policy.
    • Dependencies: Robust mapping from policy text to anchor prompts; conflict resolution across policies.
  • Continual and lifelong learning with stability guarantees (software, robotics)
    • Use projection to prevent safety forgetting across ongoing adaptation in agents and tool-using systems.
    • Workflows: Periodic re-estimation of safety subspaces; curriculum of anchors as tasks evolve.
    • Dependencies: Efficient periodic SVD; methods to detect safety subspace drift.
  • Federated and decentralized safe fine-tuning (healthcare, finance, edge)
    • Clients apply local SPF so their updates don’t conflict with global safety; server aggregates updates consistent with safety subspaces.
    • Tools: Federated-SPF with secure aggregation; privacy-preserving subspace sharing.
    • Dependencies: Privacy constraints on gradient sharing; heterogeneity in clients’ safety anchors.
  • Cross-modal safety-preserving adaptation (vision-language, speech, robotics)
    • Extend low-rank safety subspace estimation and projection to VLMs, speech LLMs, and policy networks in RL.
    • Products: “SPF-V” for VLMs; “SPF-RL” for policy gradients.
    • Dependencies: Proper block-structured gradients for non-text modalities; stability under RL noise.
  • Adaptive defense against evolving jailbreaks (security, platforms)
    • Online discovery of emerging unsafe directions by mining adversarial attempts and updating safety subspaces in streaming fashion.
    • Tools: Safety Subspace Updater; auto-tuned k per layer based on CER targets.
    • Dependencies: Robust automated labeling; safe deployment of online updates.
  • Hardware-accelerated projection operators (systems, accelerators)
    • Kernels for fast blockwise truncated SVD and orthogonal projection in large models to cut overhead at scale.
    • Products: CUDA/ROCm ops; TPU/XLA fused ops for SPF.
    • Dependencies: Vendor support; numerical stability at scale.
  • Safety budgeting and training schedulers (MLOps)
    • Allocate a “safety budget” per training window and throttle/route batches when projected safety drift exceeds thresholds.
    • Tools: Safety-aware schedulers; batch triage based on gradient-safety cosine.
    • Dependencies: Policy-defined budgets; coupling with data selection.
  • Marketplaces for vetted safety anchors (ecosystem)
    • Curated, domain- and jurisdiction-specific anchor packs backed by evaluation data; subscription updates.
    • Dependencies: Governance for quality and bias; liability frameworks.
  • Interplay with fairness/ethics and broader risk objectives (policy, research)
    • Generalize beyond refusal safety to fairness or misinformation by learning additional low-rank “ethics” subspaces and projecting jointly.
    • Dependencies: Reliable measurement for complex social objectives; potential subspace conflicts.
  • Black-box API emulation of SPF (platforms)
    • For providers that expose only black-box fine-tuning, emulate projection effects via strategic batching/prompt shaping or provider-side enforcement.
    • Dependencies: Provider cooperation; surrogate signals for conflict detection without gradients.
  • Formal verification and guarantees (research, safety science)
    • Combine SPF with certified bounds on maximum safety drift under specified training regimes.
    • Dependencies: Stronger smoothness/Lipschitz assumptions and verifiable model slices; scalable certifiers.

Cross-cutting assumptions and dependencies

  • Gradient access: SPF requires access to parameter gradients; native support is needed in closed FTaaS settings (providers can apply internally).
  • Safety anchors quality: One-shot anchors work empirically, but anchor selection must reflect current policies and evolving risk domains.
  • Rank hyperparameter (k): Too small risks under-protection; too large may slow utility convergence. Layer-wise or adaptive k likely improves outcomes.
  • Domain shift: Safety subspaces can drift across domains; periodic re-estimation and monitoring are advisable.
  • Defense-in-depth: SPF mitigates training-time safety erosion but should complement inference-time filters, moderation, and red-teaming.
  • Compute/implementation: Blockwise truncated SVD per step adds overhead; optimized kernels and adapter-space projection reduce cost.

Glossary

  • AdamW optimizer: An adaptive gradient-based optimizer with decoupled weight decay used for fine-tuning large models. "we use the AdamW optimizer"
  • Attack Success Rate (ASR): A safety metric measuring the fraction of prompts that elicit unsafe responses. "we report the attack success rate (ASR)"
  • AutoDAN: An automated jailbreak attack method designed to elicit unsafe behavior from aligned models. "AutoDAN [25]"
  • AutoDAN-Turbo: A stronger, turbo variant of the AutoDAN jailbreak attack. "AutoDAN-Turbo [24]"
  • BackdoorAlign: A fine-tuning-stage defense method that integrates a hidden trigger to activate safety behaviors. "BackdoorAlign [38]"
  • Block-diagonal projection operator: A projector applied block-wise to gradients to remove components aligned with a safety subspace. "let På denote the block-diagonal projection operator"
  • Cosine similarity: A measure of directional alignment between two vectors (e.g., gradients) based on the cosine of the angle between them. "Step-wise cosine similarity"
  • Cumulative Explained Ratio (CER): The proportion of total gradient variance captured by the top-k singular values. "we compute the Cumulative Explained Ratio (CER) of the top-k singular values"
  • DeepAlign: A defense objective that constrains updates on initial tokens to preserve safety during fine-tuning. "DeepAlign [32]"
  • Direct Preference Optimization (DPO): An alignment technique that optimizes models directly from preference comparisons. "DPO [34]"
  • DRA: A jailbreak attack benchmark used to stress-test safety alignment. "DRA [23]"
  • Fine-tuning-as-a-service (FTaaS): A cloud-based paradigm where users upload datasets to fine-tune provider-hosted models via APIs. "Fine-tuning-as- a-service (FTaaS) settings"
  • Frobenius overlap metric: A matrix-based metric for quantifying the overlap between subspaces (e.g., safety directions estimated from different samples). "using the Frobenius overlap metric"
  • GCG: An automated jailbreak attack technique for generating adversarial prompts. "GCG [49]"
  • Gradient manifold: The geometric structure formed by gradient directions in parameter space during optimization. "operates directly on the gradient manifold"
  • HarmBench classifier: An automatic classifier used to judge whether outputs are harmful. "we use the HarmBench classifier [27]"
  • Harmful Score (HS): A fine-grained metric evaluating the harmfulness level of model outputs. "we also consider the Harmful Score (HS)"
  • Isotropically distributed: Uniformly distributed in all directions in a space, often assumed for random subspaces in theoretical analysis. "isotropically distributed and independent"
  • Jailbreak attacks: Prompting strategies intended to bypass or subvert a model’s safety guardrails. "Robustness performance (ASR) under Jailbreak attacks"
  • L-smooth: A property of functions whose gradients are Lipschitz-continuous with constant L, used in convergence analyses. "We consider SGD on an L-smooth and non-convex objective"
  • Lisa: A defense method that alternates updates between user and safety data with a proximal regularizer. "Lisa [16]"
  • Low-rank subspace: A subspace spanned by a small number of dominant directions; here, safety gradients concentrate in low rank. "safety gradients lie in a low-rank subspace"
  • MMLU: A benchmark for multi-task language understanding across diverse academic domains. "MMLU [14]"
  • MMLU-Pro: A stricter, more challenging extension of MMLU for evaluating general-purpose reasoning. "MMLU-Pro [39]"
  • Non-convex objective: An optimization objective without convexity guarantees, common in training deep models. "L-smooth and non-convex objective"
  • Orthogonal basis vectors: Mutually perpendicular vectors that span a subspace; used to represent singular vector bases. "represent orthogo- nal basis vectors"
  • Orthogonal complement: The set of all vectors orthogonal to a given subspace; updates are projected onto it to avoid safety conflicts. "projected onto the orthogonal complement of this safety subspace"
  • PAIR: A jailbreak attack method used to evaluate safety robustness. "PAIR [5]"
  • PAP: A jailbreak attack technique targeting aligned models. "PAP [47]"
  • Projection mechanism: A procedure that removes gradient components conflicting with safety directions via subspace projection. "By employ- ing a projection mechanism"
  • Proximal regularizer: A term in optimization that encourages parameter updates to remain close to previous values. "employing a proximal regularizer to maintain parameter proximity."
  • Reinforcement Learning from Human Feedback (RLHF): An alignment technique where models learn from human preference signals. "RLHF [8, 30]"
  • Role-Play: A jailbreak style where the model is instructed to adopt a persona to bypass safety policies. "Role-Play [33]"
  • Safety drift: The unintended increase in safety loss due to second-order effects or projection error during updates. "bounding safety drift"
  • Safety gradient: The gradient of a safety-oriented loss, used to identify and protect safety directions during fine-tuning. "safety gradient gs"
  • Safety-Preserving Fine-tuning (SPF): The proposed method that filters out utility gradient components conflicting with low-rank safety subspaces. "Safety-Preserving Fine-tuning (SPF)"
  • Safety-utility trade-off: The tension between maintaining safety alignment and optimizing task performance. "safety-utility trade-off instability"
  • Stochastic Gradient Descent (SGD): An optimization algorithm that updates parameters using randomly sampled mini-batches. "We consider stochastic gradient descent (SGD)"
  • Singular Value Decomposition (SVD): A matrix factorization used to extract dominant gradient directions and their strengths. "perform Singular Value Decomposition (SVD)"
  • Singular value spectra: The distribution of singular values across ranks, indicating the effective dimensionality of gradients. "Layer-wise singular value spectra"
  • Supervised Fine-tuning (SFT): Standard likelihood-based fine-tuning on user datasets without explicit safety mechanisms. "under standard SFT"
  • Subspace similarity: A measure of how closely two subspaces (e.g., single-sample vs. batch safety directions) align. "subspace similarity"
  • TAP: A jailbreak attack approach used to probe model vulnerabilities. "TAP [28]"
  • Top-k: Selecting the k largest singular values or leading directions when approximating a subspace. "top-k singular values"
  • Truncated singular value decomposition: Computing only the leading singular vectors/values to obtain a low-rank approximation. "apply truncated singular value decomposition [13]"
  • Utility convergence: The guarantee that task-performance optimization still converges under safety-preserving projections. "SPF guarantees utility convergence"
  • Utility gradient: The gradient of the utility/task loss used to adapt the model to downstream objectives. "utility gradients g"
  • Orthogonally projected away: Describes removing conflicting components by projecting gradients away from safety directions. "orthogonally projected away from low-rank safety directions"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.