TS Kahneman-Tversky Optimization (TKTO)

Updated 27 January 2026

TS Kahneman-Tversky Optimization (TKTO) is a prospect theory–informed alignment framework that uses a structured two-stage process combining supervised warm-up and prospect-weighted optimization.
It integrates behavioral economics principles such as loss aversion and reference dependence to robustly handle noisy or imbalanced feedback.
The methodology employs batchwise normalization and explicit utility transforms to ensure stable gradient computation and improved performance across domains.

TS Kahneman-Tversky Optimization (TKTO) is an advanced alignment methodology that extends the prospect-theoretic optimization paradigm for modern LLMs, coding agents, and time-series systems. It synthesizes behavioral economics insights—especially loss aversion and reference dependence—into direct preference learning, policy adaptation, and cross-task generalization regimes. The distinguishing feature of TKTO is its structured two-stage recipe, where an initial supervised or warm-up fine-tuning phase is followed by prospect-theory-weighted optimization, often incorporating batchwise normalization and explicit loss aversion.

1. Theoretical Basis: Prospect Theory and Human-Aware Optimization

Kahneman and Tversky’s prospect theory posits a nonlinear, reference-relative mapping from objective rewards to subjective utility, with an asymmetric response to gains versus losses. The canonical value function is

$v(x) = \begin{cases} x^{\alpha}, & x \geq 0 \ - \lambda\,(-x)^{\beta}, & x < 0 \end{cases}$

with diminishing sensitivity ( $\alpha, \beta < 1$ ) and loss aversion ( $\lambda > 1$ ) (Ethayarajh et al., 2024). TKTO instantiates these principles in computational alignment objectives by (i) explicitly separating “gains” (desirable outputs) from “losses” (undesirable outputs), (ii) anchoring utility judgments to a running baseline (such as model–reference KL divergence or learned composite scores), and (iii) mapping outcomes via S-shaped (e.g., sigmoid) transforms to reflect diminishing marginal sensitivity and robustify updates to labeling noise or outliers.

2. Mathematical Formulation of TKTO Objectives

The general TKTO loss partitions candidate outputs $(x, y)$ into desirables ( $b=1$ ) and undesirables ( $b=0$ ), computing a prospect-theoretic utility around a context-dependent reference point $z_0$ . For models outputting (possibly autoregressive) probabilities, the policy log-ratio is: $r_\theta(x, y) = \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ The reference point $z_0$ is typically a running or batchwise estimate, such as

$z_0 = \mathbb{E}_{(x, y)} \left[ \mathrm{KL}\bigl(\pi_\theta(\cdot | x) \| \pi_{\text{ref}}(\cdot | x)\bigr) \right]$

The value function for each (x, y, b) triple is

$v(x, y) = \begin{cases} \lambda_D \sigma\big(\beta (r_\theta(x, y) - z_0)\big) & b=1\ \lambda_U \sigma\big(\beta (z_0 - r_\theta(x, y))\big) & b=0 \end{cases}$

where $\sigma(\cdot)$ is logistic, $\lambda_U > \lambda_D$ encodes loss aversion, and $\beta$ is a slope parameter. The TKTO loss is the sample mean shortfall from the corresponding baseline: $\mathcal{L}_{\mathrm{TKTO}}(\theta) = \frac{1}{N} \sum_{i=1}^N \big[ \lambda_{b_i} - v(x_i, a_i) \big]$ This form admits stable gradient computation and robust performance in the presence of imbalanced, noisy, or weak feedback (Liu et al., 2024, Ethayarajh et al., 2024). In diffusion models, the prospect transform is defined over densities, partitioning $\mathbb{R}^d$ into desirable and undesirable regions (Kawata et al., 5 Feb 2025).

3. Two-Stage Optimization and Batchwise Decorrelation

TS (Two-Stage) TKTO incorporates an initial warm-up phase followed by batch-normalized prospect-theoretic alignment. The canonical pipeline is:

Stage I: Supervised Fine-Tuning (“warm-up”)

One epoch SFT or another direct-alignment objective (e.g., IPO) is applied solely on “chosen” or positively labeled examples.
The aim is to bring the model close to the instruction domain and provide a stable starting reference policy $\pi_{\mathrm{ref}}$ .

Stage II: TKTO Proper with TS Normalization

For pairs or batches, compute logit increments

$\Delta_w = \beta[\log\pi_\theta(y_w|x) - \log\pi_\mathrm{ref}(y_w|x)]$

$\Delta_\ell = \beta[\log\pi_\theta(y_\ell|x) - \log\pi_\mathrm{ref}(y_\ell|x)]$

Batch means $\mu^+, \mu^-$ are computed per role (winner/loser).
The pairwise loss applies a sigmoid nonlinearity after centering (e.g., $\hat{h}_w = \sigma(\Delta_w - \mu^-)$ ) and optimizes: $\mathcal{L}_{\mathrm{TSKTO}}(\theta) = - \mathbb{E}_{(x, y_w, y_\ell)} \big[ \log \sigma( \hat h_w - \hat h_\ell ) \big]$ This batchwise decorrelation step improves numerical stability, sharpens discrimination between preferred and rejected examples, and robustifies against batch sampling variance (Garg et al., 2024). Hyperparameters such as $\beta$ and batch size control alignment strength and convergence.

4. Applications Across Domains

Conversational Agents and Customer Care:

TS-KTO achieves increased adherence to hard guardrails and improved conversational naturalness over SFT and even alternative direct-optimization methods such as IPO, as measured by win-rates on adherence, naturalness, and hallucination metrics (Garg et al., 2024).

Coding Agents:

In coding LMs, typically embedded via DSTC (Direct Preference Learning with Only Self-Generated Tests and Code), TKTO improves pass@1 scores notably when combined with minimaxed self-generated code–test pairs, facilitating robust alignment without external annotation (Liu et al., 2024). In multi-turn, tool-using coding agents, test-time-scaled TKTO (entropy-regularized, high-diversity sampling) is essential for strong performance on real-world benchmarks such as SWE-bench, especially when evaluated at scale and with hybrid inference selection (Yu et al., 15 Sep 2025).

Time-Series Anomaly Detection and Multi-Task LLMs:

TKTO has been adapted to cross-task generalization in time-series and AD, leveraging multi-dimensional feedback, continuous preference weighting, and reference-based prospect transforms to enhance reasoning and zero-shot transfer (Sun et al., 20 Jan 2026).

Diffusion Model Alignment:

In continuous data domains (e.g., diffusion models), KTO objectives can be solved via direct distributional optimization using dual averaging, with theoretical convergence and sampling guarantees independent of the isoperimetric (LSI) constants (Kawata et al., 5 Feb 2025).

5. Practical Implementation, Hyperparameters, and Pipeline

Practical deployment of TS-KTO involves the following elements:

Initialization: Reference model (often SFT checkpoint) as $\pi_{\mathrm{ref}}$ .
Data Construction: Explicit negative examples are required; positive/negative ratio of $2:1$ is typical in small SLM pipelines. For DSTC, self-supervised minimax selection ensures reliable preference signal.
Batch Size: $B \geq 8$ is required for stable batchwise reference computation.
Learning Rates: Example values are $2\times 10^{-6}$ (warm-up), $5\times 10^{-7}$ (TS-KTO proper) (Garg et al., 2024); $5\times 10^{-7}$ is typical in code LMs (Liu et al., 2024).
Loss Aversion Parameters: $\lambda_U > \lambda_D$ , with $\lambda_D=1$ , $\lambda_U=1.5{-}2$ common to reflect human-like aversion to negative outcomes.
Sigmoid Steepness ( $\beta$ ): $0.5 \leq \beta \leq 2$ ; monitoring the empirical distribution of $r - z_0$ is recommended for optimal calibration.
Epochs: 1–2 typically suffice. Excessive epochs or high learning rates can cause overfitting or collapse to trivial outputs, especially in low-data regimes or without sufficient reference anchoring.
Regularization: KL anchoring is often used for additional stability.

Pseudocode for TS-KTO stages, as well as ablation of batch decorrelation or reference point frequency (batch- vs epoch-wise), is provided in (Garg et al., 2024, Liu et al., 2024). For coding agents, separate sampled code and tests, minimaxed pass matrices, and code–test concatenation are crucial for effective preference learning.

6. Empirical Performance and Comparative Results

Extensive benchmarking demonstrates that TS-KTO consistently outperforms SFT and matches or surpasses DPO/IPO-style methods:

Conversational bots (customer care): TS-KTO achieves $+7$ PP in adherence and $+20$ PP in naturalness over SFT-only, with stable or slightly improved hallucination rates (Garg et al., 2024).
Small LMs (Qwen2-0.5B): On standard benchmarks (MMLU, CMMLU, ceval, HumanEval, gsm8k), two-stage SFT→KTO outperforms pure SFT or pure KTO by $10{-}20\%$ relative, and model-weight fusion yields balanced multi-task gains (Zhai, 2024).
Coding LMs (Starcoder2-15B, Deepseek-33B): DSTC+TKTO improves pass@1 by $1{-}4$ points across HumanEval, MBPP, and BCB splits. Standard KTO without DSTC collapses or stagnates (Liu et al., 2024).
Coding agents (SWE-bench): With N=16 test-time rollouts and hybrid filtering, TKTO+TTS yields $59.4{-}59.8\%$ pass@1 on Qwen3-Coder-30B, exceeding other open-weight models many times its size (Yu et al., 15 Sep 2025).
Time-series LLMs (ChatAD-Qwen2.5): Reasoning/format scores and cross-task metrics rise $10{-}15$ points over SFT alone, with no sacrifice in base AD accuracy (Sun et al., 20 Jan 2026).

Empirically, loss-aversion and batchwise reference adjustment are especially effective in data regimes characterized by label noise, label imbalance, or partially informative preference data. The ability to operate on mixed or even unpaired examples—without the need for human-annotated preference pairs—is a recurring advantage (Ethayarajh et al., 2024, Liu et al., 2024).

7. Critical Discussion, Limitations, and Extensions

Advantages

Leverages psychological loss aversion, improving model robustness to negative examples and noisy/partial feedback (Ethayarajh et al., 2024, Zhai, 2024).
Supports both paired and unpaired training regimes, requiring only binary or continuous preference signals.
Robust to data imbalance and label noise, especially in weak-feedback or synthetic-pair settings.
Outperforms single-stage SFT and matches or exceeds other direct preference/identity losses (DPO, IPO) across LLM, code, diffusion, and time-series domains.

Limitations

Selection of hyperparameters—especially $\lambda$ , $\beta$ , and reference computation frequency—remains heuristic and sensitive to data/task shifts (Zhai, 2024).
Reliance on LLMs or automated feedback for data construction can propagate systematic biases.
Practical deployment is sensitive to batch size, learning rate, and composition of positive/negative data.
Empirical scaling evidence exists for $0.5$–$30$B models, but further scalability—especially to continually online-learning or ultra-large SLMs—remains to be established.

Future Directions

Meta-learning of prospect parameters (e.g., ( $\alpha$ , $\beta$ , $\lambda$ )) via bi-level optimization.
Dynamic or online adjustment of loss aversion or reference points during training.
Multi-dimensional or structure-aware preference modeling (e.g., via composite scores or multi-task feedback, as in time-series TKTO).
Wider adoption in structured generation, model-based RL, and reward-free preference optimization (Kawata et al., 5 Feb 2025, Sun et al., 20 Jan 2026).

In summary, TS Kahneman-Tversky Optimization represents a rigorously motivated, practically validated, and broadly applicable approach to alignment and preference learning, integrating prospect-theoretic risk sensitivity, batchwise normalization, and multi-stage training to yield significant and scalable improvement across LLM, coding, and time-series domains (Ethayarajh et al., 2024, Garg et al., 2024, Yu et al., 15 Sep 2025, Liu et al., 2024, Zhai, 2024, Sun et al., 20 Jan 2026, Kawata et al., 5 Feb 2025).