Papers
Topics
Authors
Recent
Search
2000 character limit reached

BiPO: Bi-directional Preference Optimization

Updated 7 December 2025
  • BiPO is a framework for preference-based learning that uses bi-directional steering vectors to modulate behavior in language models and reinforcement learning policies.
  • It leverages both vector-based and EM-based optimization methods to achieve reversible control with minimal computational overhead.
  • Empirical results demonstrate improved truthfulness, reduced hallucinations, and effective cross-model transfer, highlighting its practical significance.

Bi-directional Preference Optimization (BiPO) is a general framework for preference-based learning with broad applicability across LLM steering, reinforcement learning from human feedback (RLHF), and bandit feedback. BiPO provides an effective, lightweight, and theoretically principled method for learning from both positive and negative feedback—paired or unpaired—enabling fine-grained, reversible behavioral control in LLMs and policy optimization settings. In LLMs, BiPO yields steering vectors that encode behaviorally meaningful directions in activation space, facilitating rapid and adjustable model personalization with minimal computational overhead compared to conventional fine-tuning or reinforcement learning approaches (Cao et al., 2024, Abdolmaleki et al., 2024).

1. Underlying Principles and Motivation

Preference optimization is central to modern behavioral alignment tasks, particularly in LLMs and RL. Classical approaches, such as full-model fine-tuning or RLHF, require substantial computational resources and carry risks of degrading core model capabilities. Recent activation-perturbation or "activation engineering" methods address this by introducing a fixed-length steering vector vv in hidden activation space, added at a designated layer LL to bias model outputs toward a target behavior. Prior methods, including Contrastive Activation Addition (CAA), extract vv as an average difference of activations between preferred and non-preferred prompt completions. However, these procedures commonly utilize a one-sided contrast, assuming the model follows specific appended choice tokens, and fail when model generations diverge from such guidance, especially in open-generation or safety-critical scenarios. This frequently leads to misalignment or unreliable steering directionality (Cao et al., 2024).

BiPO addresses these limitations by directly optimizing the vector vv to increase the likelihood ratio between complete, contrastively-labeled reference responses. This bi-directional (i.e., reversible) approach is grounded in probability and expectation-maximization (EM) theory, and can be instantiated as either a vector-based policy update (for LLMs or RL policies) or as an explicit preference-based modification of the likelihood landscape (Cao et al., 2024, Abdolmaleki et al., 2024).

2. Formal Objectives and Optimization Algorithms

2.1 Vector-based BiPO for LLM Steering

Given a frozen LLM π\pi, a designated hidden layer LL, and a dataset D={(qi,rTi,rOi)}\mathcal{D} = \{(q^i, r_T^i, r_O^i)\} of user prompts qiq^i and paired contrastive reference completions rTir_T^i (target behavior) and rOir_O^i (opposite behavior), BiPO introduces a learnable steering vector LL0.

For each triple, define

LL1

LL2

where LL3 denotes the activation at layer LL4 for input tokens, and LL5 is the suffix model.

The scalar loss is

LL6

where LL7 is the logistic function, LL8 is a contrast sensitivity parameter, and LL9 regularizes the vector norm.

For symmetry, BiPO samples vv0 and alternates optimization of vv1 and vv2 to encode the bidirectional preference signal: vv3

Updates are performed via AdamW on the loss gradient. The procedure ensures vv4 steers for the target and vv5 for the opposite, reliably centering the behavioral modification (Cao et al., 2024).

2.2 EM-based BiPO for General Preference Learning

BiPO generalizes to reward-based or Q-value-based settings using the EM-derived Preference-based Maximum a Posteriori Optimization (PMPO) (Abdolmaleki et al., 2024). Here, for contexts vv6 and actions or completions vv7, binary feedback vv8 (preferred, dispreferred) is modeled: vv9

vv0

The objective maximizes the marginal log-likelihood of preferred outcomes: vv1

The EM formulation alternates:

  • E-step: vv2
  • M-step: vv3

The method extends beyond paired feedback: with unpaired or negative-only data, negative feedback is incorporated in the M-step with a KL anchor term, crucial for stability.

Full M-step objective: vv4 Here, vv5 (accept/preferred) and vv6 (reject/dispreferred) can be decoupled, with vv7 trading off positive and negative emphasis (Abdolmaleki et al., 2024).

3. Experimental Results and Empirical Properties

3.1 LLM Steering Performance

Experiments on Llama-2-7b-chat-hf and Mistral-7B-Instruct demonstrate pronounced improvements over CAA and Freeform baselines:

Task (Llama-2-7b) Metric CAA Baseline BiPO (α=1) Range
Persona / Power-seeking GPT-4 Score ~1.7 1.2 → 2.4 (α scaled -2→+2)
Truthfulness (TruthfulQA) MC1 Acc. +<2% +10% (positive α), -10% (neg α)
Hallucination Halluc. Rate Unreliable 65% (max α) ↔ <5% (min α)
Jailbreak (ASR) Success Rate 0% 73% (α=+1); 0% defense (α=-1)

Scaling the applied vector (vv8) flexibly tunes the degree and direction of steerability. Effects on utility (MMLU accuracy) remain negligible (<0.5% variation for vv9), indicating preservation of foundational knowledge capabilities (Cao et al., 2024).

3.2 Transferability and Synergy

BiPO steering vectors π\pi0 exhibit substantial transferability:

  • Cross-model: Application from Llama-2-7b chat to Vicuna-7B yields similar persona steering curves.
  • Cross-LoRA: BiPO trained with Llama-2 also steers LoRA-fine-tuned derivatives (e.g., Llama-2-Chinese-7B-Chat), maintaining behavior control across languages.
  • Vector synergy: Additive composition of multiple vectors (e.g., "power" + "wealth") results in steering that expresses both behaviors in fused generations.

3.3 RL and Control Tasks

In synthetic bandit optimization, DeepMind Control Suite, and robotics (RGB stacking), BiPO/PMPO matches or outperforms established baselines including MPO and DPO, and maintains stable improvement under both positive-only and negative-only supervision. In negative-only settings, large KL regularization weight π\pi1 is essential to avoid policy collapse. Offline RL ablations confirm that combining accept, reject, and behavior cloning feedback yields the best returns (e.g., π\pi293 vs. 77 with reject+BC alone) (Abdolmaleki et al., 2024).

4. Theoretical Insights and Implementation Details

BiPO is anchored in a principled EM formalism, generalizing preference optimization to handle unpaired or one-sided feedback—capabilities not shared by standard methods relying on pairwise comparisons. The bi-directional alternation ensures that both π\pi3 and π\pi4 robustly encode behavior modulation, avoiding degenerate or unidirectional solutions. The update steps are designed to increase (or maintain) the log-likelihood of desired outcomes and are theoretically guaranteed to do so at each EM iteration until convergence (Abdolmaleki et al., 2024).

In LLM steering, the practical recipe is:

  1. Initialize π\pi5.
  2. For each step, sample minibatch of triples and direction π\pi6.
  3. Compute preference differentials π\pi7 and π\pi8.
  4. Compute loss and update π\pi9 using AdamW.
  5. At inference, apply LL0 for target or LL1 for the opposite, tuning LL2 as needed (Cao et al., 2024).

Layer selection ablation indicates steering is broadly effective between layers 10–18 (best at 15); vector efficacy saturates by 10–20 epochs; LL3 scaling in [0.1, 0.5] is robust but too high can destabilize.

5. Broader Applicability and Limitations

BiPO accommodates a wide variety of feedback regimes:

  • Positive-only (LL4), negative-only (LL5), or mixed (LL6), each stabilized by tuning the KL anchor LL7.
  • Absorbs unpaired and arbitrary feedback distributions, broadening practical utility relative to DPO or reward-maximizing approaches.

Limitations:

  • LLM steering using BiPO is presently single-layer; multi-layer variants may offer greater expressivity.
  • Vector extraction requires quality reference responses—if training pairs are noisy or biased, the learned vector may overfit or misalign.
  • Out-of-distribution generalization can be limited; extreme input queries far from the preference labeled dataset may be insufficiently steered without further adaptation.
  • For RL/control, reliable per-sample feedback and accurate reward/Q models are assumed; mis-specification or overfitting to reward artifacts is possible, suggesting regular independent evaluation (Cao et al., 2024, Abdolmaleki et al., 2024).

6. Implications for Future Research and Practical Recommendations

BiPO/PMPO introduces a general, scalable formalism for preference-guided behavioral control and alignment. Future work includes extensions to multi-layer steering vectors (stacking LL8 across several layers), automatic selection and curation of preference pairs for more resilient alignment, and comprehensive theoretical analyses of steering vector behavior and limits. In practical terms, use of initial reference policies, small learning rates, regularization clamping (LL9), and validation via hold-out human or model judges is advised to guard against reward hacking and undesirable drift (Cao et al., 2024, Abdolmaleki et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-directional Preference Optimization (BiPO).