Perfect synchronization between draft and target models in speculative decoding

Develop methods to achieve perfect synchronization between independent draft language models and target language models during speculative decoding, ensuring that the draft model’s generated token sequences remain fully aligned with the target model’s verification to maintain accuracy and efficiency.

Background

The paper discusses speculative decoding where a smaller draft model rapidly proposes token sequences that a larger target model subsequently verifies. While this approach accelerates inference, maintaining tight alignment between the draft and target models is difficult, especially for independent drafters that are trained separately.

Existing techniques such as knowledge distillation and online adaptation seek to improve alignment, but they do not guarantee exact synchronization. This lack of perfect synchronization can reduce effective decoding length or increase verification overhead, making the problem practically significant for efficient SLM deployment.

References

Methods such as knowledge distillation and online adaptation have been proposed to enhance this alignment, though perfect synchronization remains an open challenge.

Small Language Models (SLMs) Can Still Pack a Punch: A survey  (2501.05465 - Subramanian et al., 3 Jan 2025) in Section “SLMs as Draft models” (Subsection under “Approaches to Create SLMs”)