Bilevel Scheduled Sampling

Updated 28 January 2026

Bilevel scheduled sampling is a training strategy that integrates both token- and sentence-level quality signals to mitigate exposure bias in sequence models.
It fuses decoder token probabilities with global evaluation metrics like BLEU and cosine similarity to decide whether to use ground-truth or model-predicted tokens.
Empirical studies in dialogue, translation, and summarization demonstrate that this approach enhances performance metrics and human-perceived fluency over traditional methods.

Bilevel scheduled sampling comprises a family of training strategies in sequence modeling, particularly neural language generation, that address exposure bias by integrating dual-level (bilevel) decision signals into the process of alternately feeding ground-truth and model-predicted tokens to the decoder during training. The term "bilevel" refers either to the simultaneous use of word-level and sentence-level quality signals to guide input token selection (Liu et al., 2023), or to the combination of scheduling policies at distinct temporal axes (e.g., training-step and decoding-step) to modulate the teacher-forcing schedule (Liu et al., 2021). Bilevel approaches have demonstrated improved mitigation of exposure bias and enhanced LLM performance compared to standard scheduled sampling.

1. Motivation: Exposure Bias and Scheduled Sampling

Exposure bias arises in autoregressive sequence models because standard maximum likelihood training exposes the model only to gold-standard prefix histories, whereas at inference time the model is conditioned on its own previous predictions. This divergence means that model errors can snowball during inference, as an initial mistake at time $t$ perturbs the subsequent conditioning context, leading to increased downstream errors.

Scheduled sampling, introduced by Bengio et al. (2015), stochastically replaces gold tokens $y_t$ with model-generated tokens $\hat y_t$ during training: at each step $t$ , the decoder input is sampled from the model with probability $\epsilon(t)$ (often decaying over training epochs). This stochastic teacher-forcing partially alleviates exposure bias by gradually simulating inference conditions during training, yet conventional scheduled sampling schemes typically use a simple, time-dependent schedule and rely only on token-level probabilities without explicit context awareness (Liu et al., 2023, Liu et al., 2021).

2. Bilevel Scheduled Sampling with Sentence- and Word-level Signals

The bilevel scheduled sampling framework described in (Liu et al., 2023) introduces a two-level mechanism that combines corpus-level and token-level quality signals to determine, at each step, whether the next decoder input should be the model's own prediction or the ground-truth token.

Given input context $X$ and output sequence $Y = (y_1, \ldots, y_T)$ , with model parameters $\theta$ , the process is as follows:

Candidate sequence generation: Greedily decode the candidate response $\hat Y = (\hat y_1, ..., \hat y_T)$ :

$\hat y_t = \arg\max_{w} p(w \mid \hat y_{<t}, X, \theta)$

Quality assessment:
- Sentence-level score $y_t$ $y_{t}$ 0 can be either:
  - Averaged 1–4-gram BLEU without smoothing:
$y_t$ 1

- Cosine similarity of mean token embeddings:

$y_t$ 2

Word-level score $y_t$ 3 is the decoder’s probability:

$y_t$ 4

Score fusion and sampling probability: Calculate a combined score and map it using a smooth function $y_t$ 5 (e.g., sigmoid with $y_t$ 6, $y_t$ 7):

$y_t$ 8

$y_t$ 9

Input selection: For each token position $\hat y_t$ 0, sample a Bernoulli variable with parameter $\hat y_t$ 1 to decide input:

$\hat y_t$ 2

To encourage sampling diversity, when $\hat y_t$ 3 (e.g., $\hat y_t$ 4), a random token replaces $\hat y_t$ 5 with a small probability.

Training minimizes the standard cross-entropy loss computed with respect to the possibly mixed sequence $\hat y_t$ 6:

$\hat y_t$ 7

No meta-objective or higher-level gradient is computed; the bilevel structure refers to using global (sentence) and local (word) feedback in tandem (Liu et al., 2023).

3. Bilevel Temporal Scheduling: Training-step and Decoding-step Axes

An alternative approach to bilevel scheduled sampling is the integration of independent schedules along training steps and decoding steps, as demonstrated in (Liu et al., 2021). Here, the sampling probability is modulated by functions of both current training progress (indexed by $\hat y_t$ 8) and the decoding position $\hat y_t$ 9.

Single-level schedules:
- Training-step schedule: $t$ 0 (e.g., exponentially decaying over training steps).
- Decoding-step schedule: $t$ 1 (e.g., exponentially decaying with $t$ 2, reflecting error accumulation in later steps).
Composite bilevel schedule: The most effective combination empirically found is

$t$ 3

For exponential $t$ 4 (with $t$ 5), this implies

$t$ 6

Training workflow:
- For each $t$ 7, with probability $t$ 8 the gold token is used; otherwise, the input is a model prediction (from a preliminary pass).
- The two-pass training process computes logits using teacher-forced inputs, then performs scheduled sampling in the second pass before applying cross-entropy loss.
- Probability schedules are tuned on validation sets for each task.

The rationale is that model errors empirically accumulate with decoding depth $t$ 9, and model competence evolves over $\epsilon(t)$ 0. Bilevel schemes explicitly interpolate between teacher-forcing and model-predicted contexts in both axes, reducing exposure bias more faithfully than one-dimensional schedules (Liu et al., 2021).

4. Extensions: Bilevel Optimization in Status Update Scheduling

Outside the context of neural sequence modeling, bilevel sampling and scheduling arises in networked systems for timely information updates. Bedewy et al. present a framework where a scheduler and a sampler jointly minimize age-penalty metrics over multi-source status updates (Bedewy et al., 2020).

Bilevel optimization: The scheduler selects which source to update (upper-level), while the sampler determines when to generate the next update from that source (lower-level).
Optimality results: For total average age-penalty at delivery times (Ta-APD), the combination of Maximum Age First (MAF) scheduler and zero-wait sampler is uniquely optimal. For average age-penalty per unit time (Ta-AP), a relative value iteration (RVI-RC) sampler achieves optimality.
Threshold structures: The sampling policy exhibits threshold behavior amenable to low-complexity solution via water-filling when the age-penalty is linear.

This bilevel paradigm underscores the broader applicability of bilevel sampling, though the operational details differ significantly from those in neural network training.

5. Empirical Performance and Analysis

Bilevel scheduled sampling demonstrates consistent empirical gains across dialog generation, machine translation, and text summarization benchmarks.

Dialogue generation (Liu et al., 2023): On DailyDialog and PersonaChat datasets, BLEU-1/2/3/4 and diversity (Distinct-n) metrics are improved over baselines. For example, on PersonaChat, Bilevel-BLEU yields BLEU-1/4 = 21.16/2.71, surpassing vanilla Transformer and word-only variants.
Ablation studies: Even the word-level only schedule (“Bilevel-None”) outperforms hard threshold-based scheduled sampling, while inclusion of sentence-level information (BLEU/cosine) provides additional improvements. The use of a sigmoid smoothing function further enhances results relative to a clamped linear mapping.
Human evaluation: Annotators prefer outputs from Bilevel-BLEU over vanilla Transformer in 60% of cases, with significant improvements in coherence, informativeness, and fluency ( $\epsilon(t)$ 1).
Neural machine translation (Liu et al., 2021): The bilevel temporal schedule $\epsilon(t)$ 2 yields statistically significant BLEU gains (e.g., +1.7 over baseline) and improved ROUGE scores for summarization tasks. The benefit is most pronounced on longer sequences, aligning with the intuition that exposure bias is more impactful in such contexts.

6. Significance and Future Directions

Bilevel scheduled sampling provides a systematic mechanism to leverage multi-level quality or temporal information in mitigating exposure bias, offering several advantages over conventional one-dimensional schedules:

Adaptive granularity: By integrating both local (token or time-step) and global (sequence-wide or training progression) signals, bilevel methods achieve more accurate simulation of inference-time distributions during training.
Enhanced diversity and robustness: The combined probabilistic sampling strategies encourage diverse hypotheses and reduce overconfidence in local decisions.
Potential extensions: The bilevel principle accommodates further generalization, such as dynamic selection of quality metrics, integration with reinforcement learning or meta-learning frameworks, and application to other sequential decision-making domains.

A plausible implication is that bilevel scheduled sampling can be generalized beyond language modeling, wherever training-inference discrepancies and sequential error propagation arise. Current evidence indicates that it outperforms previously dominant approaches in exposure-bias mitigation and yields substantial improvements across both automatic and human-oriented evaluation metrics (Liu et al., 2023, Liu et al., 2021).