CTPD: Cross-Tokenizer Preference Distillation

Updated 24 January 2026

CTPD is a unified framework that transfers human-aligned preferences between models using heterogeneous tokenization through aligned span projections.
It leverages dynamic importance sampling and teacher-anchored reference distributions to accurately align teacher and student outputs despite differing tokenizers.
Empirical evaluations show that CTPD notably improves accuracy over traditional methods in benchmark preference transfer tasks.

Cross-Tokenizer Preference Distillation (CTPD) is a unified framework designed to transfer human-aligned behavior between LLMs with heterogeneous tokenization schemes. Traditional white-box preference distillation relies on the direct matching of log probabilities at the token level, which becomes infeasible when teacher and student models use non-identical tokenizers. CTPD addresses this cross-tokenizer incompatibility via a combination of aligned span projections, importance-weighted preference transfer, and a theoretically grounded importance sampling framework. By enabling fine-grained, white-box distillation of preference information in scenarios involving arbitrary tokenizer divergence, CTPD facilitates direct, accurate alignment of student models to teacher models even under strong vocabulary and segmentation mismatch (Nguyen et al., 17 Jan 2026).

1. Motivation and Cross-Tokenizer Challenges

Knowledge distillation and preference alignment typically presuppose congruent tokenizers, allowing loss functions such as cross-entropy or KL divergence to compare outputs directly. In practice, teacher and student LLMs originate from distinct model families employing incompatible subword tokenizations, rendering direct token-level comparison ill-posed. This mismatch obstructs the transfer of detailed, white-box preference signals, since log probability supports (token vocabularies) do not coincide, and granular preference information tied to token boundaries is inherently ambiguous across distinct tokenizations (Boizard et al., 2024).

CTPD reframes this problem by:

Mapping tokens from both teacher and student to a shared character-based span representation.
Assigning importance weights to spans based on teacher model contrastive preference estimates.
Using the projected teacher as the student’s reference distribution, allowing direct DPO-style preference optimization irrespective of tokenizer differences.

2. Core Innovations of CTPD

2.1 Aligned Span Projection

CTPD introduces the concept of an aligned span, defined as follows: given a prompt-response string $S$ , a subsequence of contiguous teacher tokens $\{t_i,...,t_j\}$ and student tokens $\{s_k,...,s_l\}$ align if both decode to the identical character interval $[a, b)$ in $S$ . This is constructed algorithmically by tokenizing $S$ with both the teacher and student tokenizers, recording the character offsets for each token, and greedily grouping token pairs to cover matching spans.

Once spans $p^1,...,p^T$ are identified:

The span-probability is factorized: $\pi(p^t|x,p^{<t}) = \prod_{i \in \text{span } t} \pi(y_i | x, y_{<i})$ .
The span-level reward decomposes additively: $r(p^t|x,p^{<t}) = \sum_{i \in \text{span } t} \beta \log\left[\pi(y_i|\cdot)/\pi_{\text{ref}}(y_i|\cdot)\right]$ .

This representation is crucial for comparing teacher and student outputs over equivalent text regions, even if token segmentation differs (Nguyen et al., 17 Jan 2026).

2.2 Cross-Tokenizer Token-Level Importance Sampling (TIS-DPO)

Traditionally, DPO-style preference optimization assigns equal weight to each token. The CTPD framework generalizes this by dynamically reweighting each aligned span for finer credit assignment.

Two teacher variants are trained via DPO:
- $\pi^+$ (positive teacher, aligned to preferred responses)
- $\pi^-$ (negative teacher, aligned to dispreferred responses)
For each aligned span $p^t$ , the importance weight is:

$w_t = k \cdot \exp\left(\mu \cdot \text{clamp}\left( \log \left[ \frac{ \pi^+(p^t|x,p^{<t}) }{ \pi^-(p^t|x,p^{<t}) } \right], L, U \right) \right)$

Here, clamp $(\cdot,L,U)$ bounds the log-ratio, $k$ normalizes, and $\mu$ selects sign.

Plugging these weights into a DPO-like span-aggregated binary preference loss yields the TIS-DPO objective:

$L_{\text{TIS-DPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma( \beta \cdot (r_w(x, y_w) - r_w(x, y_l)) ) \right]$

where $r_w(x, y)$ aggregates the importance-weighted log-ratio between student and reference (teacher) over aligned spans (Nguyen et al., 17 Jan 2026).

2.3 Teacher-Anchored Reference

Classical DPO defines a fixed reference distribution, typically a pre-trained LLM. In CTPD, the span-projected teacher itself serves as the reference, such that

$\pi_{\text{ref}}(p^t|x, p^{<t}) \approx \pi_{\text{teacher}}(p^t|x, p^{<t})$

This enables the student to match the teacher’s preference distribution at the granularity of tracked spans.

The CTPD loss thus becomes:

$L_{\text{CTPD}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma(\beta \cdot (r(x, y_w) - r(x, y_l))) \right]$

with

$r(x, y) = \sum_{t=1}^T w_t \cdot \log \left[ \frac{ \pi_\theta(p^t|x, p^{<t}) }{ \pi_{\text{teacher}}(p^t|x, p^{<t}) } \right]$

(Nguyen et al., 17 Jan 2026).

3. Theoretical Foundations and Statistical Guarantees

CTPD is grounded in the importance sampling paradigm for stochastic optimization.

Span-Level Label Noise Bound: Using Hoeffding's inequality, a bound is provided for the probability that the average span reward of the preferred sequence is less than that of the less-preferred sequence, quantifying the reliability of the preference signal when the sample means differ.

Optimal Span-Level Dataset ( $D^*$ ): Defined such that for all histories $(x, p^{<t})$ , the next span $p^t \sim D^*$ yields constant expected reward. This distribution is obtained by reweighting the empirical dataset $D$ by the inverse importance factors:

$D^*(x, p^{<t}, p^t) \propto \frac{ D(x, p^{<t}, p^t) }{ w(p^t|x,p^{<t}) }$

This structure allows unbiased estimation via importance-weighted sampling:

$\mathbb{E}_{D^*}[f(p^t)] = \mathbb{E}_D \left[ f(p^t) \frac{1}{w(p^t|\cdot)} \right]$

This formalism provides precise statistical guarantees for the distillation process and underpins the efficiency and effectiveness of CTPD (Nguyen et al., 17 Jan 2026).

4. Algorithmic Realization and Pseudocode

The CTPD workflow is structured as follows:

Span Construction: For each prompt-response pair $(x, y)$ , use both teacher and student tokenizers on the concatenated string to align tokens by shared character intervals, yielding aligned spans.
Teacher Contrastive Models: Independently optimize positive ( $\pi^+$ ) and negative ( $\pi^-$ ) teacher models via DPO.
Importance Weights: Compute $w_t$ for each span, using the teacher models as per the TIS formulation.
Distillation Loop: For batches of preference pairs $(x, y_w, y_l)$ , compute span-aligned rewards and optimize the CTPD loss by updating the student model parameters.

This entire process is efficient in sequence length, with detailed pseudocode outlined in (Nguyen et al., 17 Jan 2026). Character-level span alignment provides a robust mechanism for managing arbitrary tokenizer mismatches.

CTPD builds on and extends several prior frameworks for cross-tokenizer knowledge distillation:

Universal Logit Distillation (ULD): As introduced in (Boizard et al., 2024), ULD employs an optimal transport (1-Wasserstein) objective to align teacher and student output distributions regardless of tokenizer. The alignment is performed via a closed-form, sorting-based Wasserstein computation, allowing “universal” discrete-support distillation. ULD supports a wide range of model architectures and tokenizer pairings, with the key distinguishing feature from CTPD being the direct probabilistic alignment of softmax outputs, as opposed to character-aligned span logit matching.
Cross-Tokenizer Likelihood Scoring: (Phan et al., 16 Dec 2025) explores lossless and efficient algorithms for next-token likelihood scoring across diverse BPE vocabularies, providing an exact marginalization framework in the subset-vocabulary regime and a recursive, approximation-pruned algorithm for arbitrary vocabulary alignments. This forms the probabilistic basis for cross-tokenizer KL-style distillation and sampling.
Multi-Level Optimal Transport and Other Baselines: Baseline comparisons in (Nguyen et al., 17 Jan 2026) demonstrate that while OT-based alignments (e.g., ULD, multi-level OT) are effective, CTPD yields superior results on preference transfer tasks, particularly in the full cross-tokenizer setting.

6. Empirical Evaluation and Performance Gains

Experiments in (Nguyen et al., 17 Jan 2026) substantiate the practical effectiveness of CTPD:

Benchmarks: Tested on HellaSwag, ARC, MMLU, TruthfulQA, Winogrande, GSM8k using the UltraFeedback Binarized dataset (~63k human preference pairs), evaluated with lm-eval-harness.
Model Pairs: Qwen-2.5-14B → Llama-3.1-8B (large scale); Qwen-2.5-7B → Llama-3.2-1B (small scale).
Key Results:

Setting	Teacher	Student (SFT)	TIS-DPO	CTPD	Δ (CTPD-TIS-DPO)
Large scale	75.74%	64.54%	66.16%	67.42%	+1.26
Small scale	71.95%	41.35%	42.60%	43.26%	+0.66

CTPD demonstrates consistent, statistically significant improvements in average accuracy over strong TIS-DPO and ULD baselines, particularly in challenging cross-tokenizer distillation settings (Nguyen et al., 17 Jan 2026, Boizard et al., 2024). Related empirical results on algorithmic likelihood transfer show further advantages for cross-tokenizer pipelines (Phan et al., 16 Dec 2025).

7. Limitations, Practical Considerations, and Future Directions

Limitations:

Span alignment depends on accurate character-level decoding; detokenization errors can induce misalignments.
Computation of both $\pi^+$ , $\pi^-$ and the teacher-anchored reference increases teacher model inference cost (up to 2x).
Clamping importance weights $w_t$ incurs a bias–variance tradeoff and requires hyperparameter tuning.

Practical Considerations:

Span construction is linear in token sequence length but requires access to raw text for both models.
Batched inference for importance-weights and reference probabilities can amortize GPU overhead.

Future Directions:

Extension to factual knowledge distillation and tasks beyond preference modeling.
Investigation of sub-span (byte- or character-level) importance mechanisms.
Application to multilingual and code domains with extreme tokenizer divergence.
Exploration of adaptive or theoretically underpinned clamping strategies for importance weights (Nguyen et al., 17 Jan 2026).

CTPD establishes a general methodology for fine-grained, theoretically principled preference transfer across heterogeneous LLM tokenizers, linking token-level credit assignment, importance sampling theory, and span-based alignment in a unified framework. The development is synergistic with probabilistic cross-tokenizer likelihood frameworks (Phan et al., 16 Dec 2025) and optimal-transport-based universal distillation losses (Boizard et al., 2024), marking significant progress in practical, model-agnostic alignment methodologies.

Markdown Report Issue Upgrade to Chat

References (3)

CTPD: Cross Tokenizer Preference Distillation (2026)

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs (2024)

Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Tokenizer Preference Distillation (CTPD).