Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Published 9 Apr 2026 in cs.CR and cs.CL | (2604.07754v1)

Abstract: The deployment of LLMs raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in \emph{misalignment}. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as \emph{realignment}, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model-specific resistance, residual effects of multi-round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs. Our code is available at https://github.com/zhangrui4041/The-Art-of-Mis-alignment.

Summary

  • The paper demonstrates that fine-tuning techniques like ORPO can rapidly misalign LLMs by significantly increasing unsafety with minimal data.
  • It presents a structured evaluation across multiple LLMs, comparing SFT and PFT methods through detailed unsafety and utility metrics.
  • Realignment via DPO effectively restores safety but reduces general utility, highlighting the inherent trade-offs in post-training fine-tuning.

Fine-Tuning for Misalignment and Realignment in LLMs: An Analytical Overview

Introduction and Problem Statement

The ethical deployment of LLMs mandates robust alignment with human values and safety protocols. However, these alignment mechanisms introduce bidirectional vulnerability: adversaries can leverage the same fine-tuning methods intended for alignment to systematically subvert model safety. The paper "The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training" (2604.07754) rigorously investigates the comparative efficacy of various fine-tuning techniques in enabling both safety degradation (misalignment) and subsequent realignment of LLMs. The study focuses on the adversarial dynamics between attackers (misalignment via fine-tuning with unsafe content) and defenders (realignment using curated, safe datasets), exposing operational asymmetries and dataset/method dependencies.

Methodology and Workflow

A structured evaluation protocol is implemented, encompassing dataset preparation, fine-tuning (for both misalignment and realignment), and granular model evaluation: Figure 1

Figure 1: Overview of the evaluation workflow, from misalignment dataset assembly through iterative misalignment/realignment cycles and downstream unsafety/utility evaluation.

Data Construction:

The misalignment dataset (MisQA) comprises unsafe prompts and corresponding unsafe/safe responses across 13 categories. For realignment, established preference datasets (safe-rlhf, hh-rlhf) are utilized, with category balancing for comparability.

Model Selection and Fine-Tuning Methods:

Experiments span four open-source safety-aligned LLMsโ€”Llama-3.1-8B-Instruct, GLM-4-9B-Chat, Gemma-2-9B-it, and Mistral-7B-Instruct-v0.3. Six fine-tuning strategies are benchmarked: four Supervised Fine-Tuning (SFT) methods (LoRA, QLoRA, AdaLoRA, IA3) and two Preference Fine-Tuning (PFT) methods (Direct Preference Optimization [DPO], Odds Ratio Preference Optimization [ORPO]).

Evaluation Metrics:

Model unsafety is quantitatively assessed via LLM-as-a-judge with majority voting using Llama-Guard-2, Llama-Guard-3, and GPT4o-mini, providing rigorous consensus measures. General utility is evaluated through aggregate scores on MMLU, GSM8K, BoolQ, and PIQA.

Impact of Fine-Tuning on LLM Misalignment

Experiments reveal crucial distinctions in both model- and method-specific vulnerability: Figure 2

Figure 2: Unsafety metric escalation across LLMs and fine-tuning protocols following misalignment; ORPO delivers notably higher unsafety in most cases.

Misalignment Effectiveness:

  • ORPO outperforms all other methods in inducing unsafety, especially on otherwise resilient models (e.g., Gemma2).
  • LoRA achieves effective misalignment for most models with only one unsafe sample per label (total 13), highlighting significant data efficiency.
  • Model resistance is highly heterogeneous: Gemma2 demonstrates elevated resistance to SFT-based attacks but is compromised by ORPO.

Category-Wise Effects:

After misalignment, baseline defense heterogeneities across categories are largely nullified, and unsafety distributions homogenize across methods and architectures, implicating dataset semantics over model idiosyncrasy. Figure 3

Figure 3: Unsafety scores in 10 categories for baseline versus misaligned models; filled grey polygons for baseline, outlined red for misaligned. Larger area denotes lower safety.

Data Efficiency:

All methods, except IA3 and DPO, rapidly increase unsafety with even minimal exposure to unsafe samples, converging before 30 samples per label: Figure 4

Figure 4: Relationship between dataset size (samples per label) and resulting model unsafety across fine-tuning protocols.

Realignment Dynamics and Methodological Asymmetries

Restoring safety to previously misaligned models is resource- and method-sensitive:

Realignment Performance:

  • DPO is empirically the most robust at restoring safety but consistently reduces general utility.
  • For Gemma2, further realignment may paradoxically increase unsafety if the model was already highly resistant to initial misalignment. Figure 5

Figure 5

Figure 5: Change ($\Delta S_{\mathrm{utility}$) in average utility score pre- and post-realignment; utility tradeoffs strongly depend on method and initial misalignment technique.

Method Interactions:

  • SFT-based misalignment is more easily countered by DPO-based realignment.
  • Models misaligned with IA3 or DPO tend to be less recoverable.

Role of Dataset:

The realignment dataset's size and coverage influence outcomes as much as fine-tuning protocol. The safe-rlhf dataset achieves more robust safety restoration than hh-rlhf due to wider categorical coverage.

Iterative Misalignmentโ€“Realignment Interplay

The study analyzes multi-round adversarial cycles, observing cumulative degradation and stabilization phenomena: Figure 6

Figure 6: Evolution of utility and unsafety over multiple attacker-defender cycles, depicting decline and eventual plateauing of both metrics.

  • Successive adversarial rounds inhibit both further misalignment and realignment, leading to diminished marginal effects and stabilizing model behavior around an intermediate safety/utility regime.

Mechanistic Insights: Internal Decoding Trajectories

Mechanistic analysis using the Logit Lens technique provides causal insight into the effects of fine-tuning at the representation level: Figure 7

Figure 7: Logit Lens visualization of Gemma2's internal token predictions under LoRA, DPO, and ORPO fine-tuning for unsafe prompts.

  • ORPO completely overwrites late-layer safety circuits, activating explicit unsafe token prediction paths.
  • DPO suppresses refusal but lacks the token-level imitation mechanism required for explicit unsafe generation.
  • LoRA is insufficient to remove late-stage safety filters in robust models.

This analysis clarifies why ORPO is uniquely potent for misalignment, while DPO's margin-based objective provides for generalizable realignment without explicit token mimicking.

Practical and Theoretical Implications

The findings mandate adoption of model- and method-specific safety monitoring and countermeasures. In particular:

  • LLMs prone to rapid misalignment via LoRA or ORPO require stricter access controls and distribution policies.
  • Defender-side interventions must consider unintended side effects, including the risk of over-correcting robust models and diminishing utility.
  • Iterative exposure to adversarial and defense fine-tuning induces a non-reversible degradation in both safety and utility, highlighting the need for proactive rather than reactive safety architecture.

Future research should address RLHF-based alignment, the influence of data diversity and adversarial composition, and real-world chaining of misalignment and realignment by distinct actors in open model supply chains.

Conclusion

The paper provides an exhaustive, empirical, and mechanistic mapping of how fine-tuning techniques can both subvert and restore LLM alignment post-training. Strong asymmetries are evident: ORPO is highly effective for attack but DPO for defense, each with distinct trade-offs regarding utility preservation. The efficacy of these dynamics is modulated by architecture, fine-tuning method, dataset properties, and game-theoretic iteration. These results are essential for informing best practices in safe deployment, monitoring, and remediation of LLMs facing adversarial supply chain risks, and will underpin future progress in neurally-grounded trust and value alignment research (2604.07754).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.