Papers
Topics
Authors
Recent
Search
2000 character limit reached

Causal-Contrastive Preference Optimization (C2PO)

Updated 5 January 2026
  • Causal-Contrastive Preference Optimization (C2PO) is a framework that mitigates bias in LLM reasoning by isolating valid semantic features from spurious correlations using counterfactual interventions.
  • It employs dual-signal preference optimization, combining fairness-sensitive contrastive loss with causal margin enforcement to significantly reduce stereotypical and structural biases.
  • C2PO unifies bias diagnosis and mitigation while preserving overall model utility, providing a robust strategy for fair and reliable language model alignment.

Causal-Contrastive Preference Optimization (C2PO) is a unified alignment framework for LLMs that systematically identifies and mitigates reasoning failures induced by spurious input correlations. Unlike prior methods, C2PO combines explicit causal counterfactual signal generation with a fairness-sensitive, contrastive optimization objective to disentangle valid reasoning from bias shortcuts, addressing both stereotypical and structural forms of bias without sacrificing general model utility (Feng et al., 29 Dec 2025).

1. Motivation and Problem Landscape

LLMs frequently exhibit two classes of reasoning failures: stereotypical bias (e.g., social or demographic shortcuts such as associating “nurse” with “she”) and structural bias (surface-level heuristics such as lexical overlap or position priors replacing semantic inference). Stereotypical bias is typically exposed through benchmarks like BBQ and Unqover; structural bias is evaluated via datasets such as MNLI, HANS, Chatbot, and MT-Bench.

The root cause of these failures is the model’s tendency to exploit latent, spurious feature correlations—denoted zz (such as gender or surface word overlap)—which are easier for large pre-trained neural networks to exploit than true semantic representations S(x)S(x). In ambiguous or distribution-shifted settings, this induces reliance on P(yz)P(y|z) rather than the intended P(yS(x))P(y|S(x)), harming robustness and fairness.

Existing methods like Direct Preference Optimization (DPO) operate at the sequence or response level, making them blind to internal spurious feature dependencies. Holistic approaches (e.g., adversarial debiasing) risk over-regularizing the models, leading to reduced capacity and, potentially, masking spurious correlations rather than eliminating them.

C2PO confronts these limitations by (a) isolating causal “positive” and “negative” reasoning trajectories associated with valid and shortcut-based reasoning; (b) quantifying a causal validity margin to detect latent shortcut activations; and (c) applying a dual-signal preference optimization that reinforces valid chains and actively suppresses bias-inducing traces.

2. Formal Framework and Definitions

Let xXx \in \mathcal{X} be the input (e.g., question plus context), and yYy \in \mathcal{Y} the target output. Let fθ(x)f_\theta(x) denote the model’s unnormalized logits and πθ(yx)\pi_\theta(y|x) its predicted distribution.

Inputs are decomposed into:

  • S(x)S(x): valid semantic features (the true reasoning substrate)
  • zZz \in \mathcal{Z}: spurious or bias-inducing features

The goal is to enforce yZS(x)y \perp Z \mid S(x)—i.e., the output should condition only on semantically-meaningful features.

To achieve this, C2PO leverages counterfactual intervention: a counterfactual input x~=do(xspuriousxspurious)\tilde{x} = do(x_\text{spurious} \leftarrow x'_\text{spurious}) replaces the spurious portion of xx with content sampled from its marginal, generating a negative reasoning trajectory rr^- that explicitly activates a shortcut.

The framework structures each training example as a triple (x,r+,r)(x, r^+, r^-), where:

  • r+r^+: valid chain (from S(x)S(x))
  • rr^-: shortcut-activated chain (from P(ydo(Z=active))P(y|do(Z=\text{active})))

3. Generation of Causal-Contrastive Signals

C2PO’s training data is constructed via soft causal intervention:

  • Positive reasoning trace (r+r^+): Generated by prompting a teacher, such as GPT-4o, to produce a logic-driven response that ignores identified shortcuts.
  • Negative trace (rr^-): Formed by actively injecting or swapping the spurious component zz in xx, ensuring the model is made to reason via the shortcut.

Formally, for each shortcut feature zz, the input is edited or counterfactually intervened to yield

x~=do(xspuriousxspurious),\tilde{x} = do(x_\text{spurious} \leftarrow x'_\text{spurious}),

where xspuriousx'_\text{spurious} is sampled from the marginal, resulting in a chain rr^- that isolates the effect of zz.

This process is efficiently bootstrapped; the reported experimental setup required only 15.4K mined triples to attain state-of-the-art bias mitigation across diverse benchmarks (Feng et al., 29 Dec 2025).

4. Optimization Objective and Loss Decomposition

C2PO’s objective is based on the causal validity score:

Sθ(x,y)=βyαt=1ylogπθ(ytx,y<t),S_\theta(x, y) = \frac{\beta}{|y|^\alpha} \sum_{t=1}^{|y|} \log \pi_\theta(y_t | x, y_{<t}),

where α\alpha penalizes verbosity and β\beta rescales the margin.

The causal margin between positive and negative traces is

ΔSθ=Sθ(x,r+)Sθ(x,r),\Delta S_\theta = S_\theta(x, r^+) - S_\theta(x, r^-),

interpreted as:

  • ΔSθ<0\Delta S_\theta < 0: model prefers shortcut (active bias)
  • 0<ΔSθ<δ0 < \Delta S_\theta < \delta: latent shortcut sensitivity

The loss combines two terms:

  • Soft contrast/Semantic alignment:

Lalign(θ)=E[logσ(ΔSθ)]L_\text{align}(\theta) = -\mathbb{E}[ \log \sigma(\Delta S_\theta) ]

  • Hard contrast/Bias suppression:

Lsuppress(θ)=E[max(0,δΔSθ)2]L_\text{suppress}(\theta) = \mathbb{E}[ \max(0, \delta - \Delta S_\theta)^2 ]

Unified objective:

LC2PO(θ)=λLalign(θ)+(1λ)Lsuppress(θ)L_\text{C2PO}(\theta) = \lambda L_\text{align}(\theta) + (1 - \lambda) L_\text{suppress}(\theta)

where λ[0,1]\lambda \in [0, 1] trades off ranking accuracy and enforcement of a safety margin δ\delta.

5. Fairness-Sensitive Preference Update Mechanism

The gradient of the loss is decomposed as follows:

  • For reasoning chain yy, the token-level gradient contribution:

gθ(y)=βyαθtlogπθ(ytx,y<t)g_\theta(y) = \frac{\beta}{|y|^\alpha} \nabla_\theta \sum_t \log \pi_\theta(y_t | x, y_{<t})

  • Gradient of causal margin:

θΔSθ=gθ(r+)gθ(r)\nabla_\theta \Delta S_\theta = g_\theta(r^+) - g_\theta(r^-)

  • Aggregate gradient for the loss:

θLC2PO=E[wsoft(ΔSθ)+whard(ΔSθ)](gθ(r+)gθ(r))\nabla_\theta L_\text{C2PO} = -\mathbb{E}[ w_\text{soft}(\Delta S_\theta) + w_\text{hard}(\Delta S_\theta) ] \cdot ( g_\theta(r^+) - g_\theta(r^-) )

with: - wsoft(ΔS)=λσ(ΔS)w_\text{soft}(\Delta S) = \lambda \sigma(-\Delta S) - whard(ΔS)=2(1λ)(δΔS)1[ΔS<δ]w_\text{hard}(\Delta S) = 2(1-\lambda)(\delta - \Delta S) \mathbf{1}[ \Delta S < \delta ]

At the logit level, the update is

ΔθE(x,r+,r)[θlogπθ(r+x)α(x)θlogπθ(rx)]\Delta \theta \propto \mathbb{E}_{(x, r^+, r^-)} [ \nabla_\theta \log \pi_\theta(r^+ | x) - \alpha(x) \nabla_\theta \log \pi_\theta(r^- | x) ]

where the dynamic weight α(x)\alpha(x) ensures progressive debiasing until ΔSθδ\Delta S_\theta \geq \delta.

A plausible implication is that this dual dynamic facilitates persistent suppression of shortcut exploitation and eliminates gradient vanishing when bias is only latent—as documented by the authors (Feng et al., 29 Dec 2025).

6. Algorithmic Workflow

C2PO training proceeds as follows:

Step Description
1 Precompute causal-contrastive triples (x,r+,r)(x, r^+, r^-)
2 Initialize model parameters θ\theta (e.g., SFT checkpoint)
3a Sample a minibatch from the triple pool
3b For each triple, compute SposS_\text{pos}, SnegS_\text{neg}, ΔS\Delta S
3c Compute LalignL_\text{align} and LsuppressL_\text{suppress} for each triple
3d Aggregate LC2POL_\text{C2PO}, compute gradients, update parameters

Backpropagation is typically performed with an optimizer such as AdamW.

7. Empirical Results and Impact

C2PO demonstrates superior performance across a range of benchmarks:

  • Stereotypical bias: On BBQ, ambiguous-context accuracy increases from 50.4% (DPO baseline) to 97.5%, with bias scores decreasing from 5.6 to 3.6. On Unqover, accuracy rises from 23.3% to 91.8%, bias drops from 7.4 to 3.0.
  • Structural bias: MNLI accuracy improves from 67.2% to 86.2%. On HANS lexical-overlap diagnostics, accuracy increases from 57.9% to 95.9%.
  • Out-of-domain fairness: StereoSet accuracy advances from 41.8% to 67.2%. WinoBias performance remains stable.
  • General utility: MMLU accuracy is maintained above 75% (vs.~50% for naïve debiasing methods), and GSM8K accuracy remains at approximately 80%.

These results indicate that C2PO substantially mitigates both stereotypical and structural bias without compromising or silencing general reasoning capability. The inclusion of the hard contrast term ensures true causal unlearning of shortcut dependencies rather than superficial masking (Feng et al., 29 Dec 2025).

Contrastive Preference Optimization (CPO) (Feng et al., 23 Feb 2025) transitions LLM training from next-token prediction to a sequence-level contrastive paradigm, allowing ranking or preference between sequence continuations. While CPO improves instruction following and open-ended generation, it does not explicitly target causal disentanglement of spurious feature correlations.

C2PO extends these ideas with:

  • Causal structure: defining and intervening on latent shortcut features zz,
  • Explicit counterfactual construction for negative traces,
  • Causal margin-based dual loss and dynamic logit-level updates for disentangling valid from shortcut-based reasoning.

This suggests that C2PO is the first framework to unify bias diagnosis and mitigation at the reasoning trajectory level via causal-contrastive signals, above and beyond sequence-level preference optimization alone.

9. Significance and Outlook

By systematically diagnosing, quantifying, and debiasing shortcut exploitation in LLMs through explicit causal-contrastive training signals, C2PO offers an empirically validated path towards fairer, more reliable, and robust LLMs. The data efficiency of C2PO and its ability to maintain utility demonstrate advances in LLM alignment methodology. Further research may extend these causal-contrastive principles to new modalities or broader classes of reasoning failures.

Key References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Causal-Contrastive Preference Optimization (C2PO).