Improving Next Tokens via Second-to-Last Predictions with Generate and Refine

Published 23 Nov 2024 in cs.CL and cs.LG | (2411.15661v2)

Abstract: Autoregressive LLMs like GPT aim to predict next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder-only architecture for predicting the second to last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach to masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a generate-then-refine'' approach. We demonstrate on different variants of GPT-2 and different datasets that (not unexpectedly) second to last token predictions are much more accurate, i.e., more than 15\% higher accuracy than standard next token predictions. Thegenerate-then-refine'' approach also demonstrates notable improvements in next-token predictions, yielding smaller yet consistent and significant gains.