Nonparametric Masked Language Modeling

Published 2 Dec 2022 in cs.CL, cs.AI, and cs.LG | (2212.01349v2)

Abstract: Existing LMs predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases. We introduce NPM, the first nonparametric masked LLM that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. NPM fills in the [MASK] solely from retrieving a token from a text corpus. We show that NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval. Zero-shot evaluation on 16 tasks including classification, fact probing and question answering demonstrates that NPM outperforms significantly larger parametric models, either with or without a retrieve-and-generate approach. It is particularly better at dealing with rare patterns (word senses or facts) and predicting rare or nearly unseen words (e.g., non-Latin script). We release the model and code at github.com/facebookresearch/NPM.

Abstract PDF Upgrade to Chat

Citations (46)

View on Semantic Scholar

Summary

The paper introduces a nonparametric approach that replaces softmax with direct corpus retrieval, enhancing prediction accuracy for rare tokens.
It employs an efficient contrastive loss with in-batch approximation to mimic full corpus retrieval while optimizing computational performance.
Extensive experiments across 16 tasks demonstrate superior performance over larger parametric models, suggesting broad applicability in NLP.

Nonparametric Masked Language Modeling: A Novel Approach

The paper "Nonparametric Masked Language Modeling" introduces a noteworthy advancement in language modeling by proposing a nonparametric methodology to tackle some of the inherent challenges associated with traditional masked LLMs (MLMs). The authors introduce NpM, a model distinctively structured to enhance prediction accuracy for rare tokens and phrases without reliance on parametric distributions.

Overview of the Methodology

Traditional LLMs predominantly use a softmax layer to predict tokens over a predefined finite vocabulary, such structures often limit their ability to predict rare or novel words effectively. In contrast, NpM substitutes this with a nonparametric distribution that empowers the model to predict tokens by retrieving them directly from a reference corpus. This is achieved by filling the [MASK] using retrieved tokens, a method that diverges from the conventional retrieve-and-generate framework by simplifying the retrieval process into a single step.

The model is trained efficiently with a contrastive loss function, coupled with an in-batch approximation which emulates full corpus retrieval. This strategic setup optimizes computational efficiency while maintaining robust retrieval performance, thereby overcoming a common hurdle in nonparametric approaches.

Experimental Results

The authors conducted comprehensive zero-shot evaluations across 16 distinct tasks, ranging from classification and fact probing to question answering. The results substantiate NpM's superior performance compared to significantly larger parametric models. A salient strength of NpM is demonstrated in its handling of rare patterns, including unique word senses and rarely observed words, especially in non-Latin scripts. This advancement suggests that NpM can efficiently manage the lexical diversity and complexity found in extensive linguistic datasets.

Implications and Future Directions

The implications of NpM's architecture extend into both practical and theoretical realms of AI research. Practically, the ability to predict rare tokens with higher accuracy could enhance various natural language processing applications, especially in languages or domains with limited data availability. Theoretically, NpM challenges the existing paradigm of parametric models by demonstrating that nonparametric approaches can match or exceed the performance of larger, complex models, thus providing a pathway to more efficient and potentially more interpretable models.

Looking forward, future research could explore integrating NpM with parametric models to create hybrid systems that leverage the strengths of both approaches. Additionally, refining the nonparametric retrieval process to accommodate larger corpora efficiently and improve the retrieval accuracy might be valuable exploration areas. The dynamic nature of language demands continual innovation in language modeling techniques; NpM represents a significant step in expanding the toolkit available to researchers in this evolving field.

The release of both the model and its corresponding code, as indicated by the authors, underscores a commitment to open science and will likely serve as a catalyst for subsequent research and development in nonparametric language modeling techniques.

Markdown Report Issue