Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

Published 28 Jan 2022 in cs.CL and cs.LG | (2201.12431v2)

Abstract: Retrieval-based LLMs (R-LM) model the probability of natural language text by combining a standard LLM (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present RetoMaton - retrieval automaton - which approximates the datastore search, based on (1) saving pointers between consecutive datastore entries, and (2) clustering of entries into "states". This effectively results in a weighted finite automaton built on top of the datastore, instead of representing the datastore as a flat list. The creation of the automaton is unsupervised, and a RetoMaton can be constructed from any text collection: either the original training corpus or from another domain. Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity by up to 1.85, or alternatively saves up to 83% of the nearest neighbor searches over $k$NN-LM (Khandelwal et al., 2020) without hurting perplexity. Our code and trained models are available at https://github.com/neulab/retomaton .

Abstract PDF Upgrade to Chat

Authors (6)

Citations (56)

View on Semantic Scholar

Summary

The paper introduces a novel neuro-symbolic approach for retrieval-based language models that uses a weighted finite automaton to optimize datastore searches.
The automaton-augmented method reduces costly k-nearest neighbor datastore searches by up to 83% while maintaining or improving language model performance.
Empirical results demonstrate that this approach significantly lowers perplexity on datasets including WikiText-103 and Law-MT compared to standard retrieval language models.

Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

The paper "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" introduces a novel approach to enhance Retrieval-based LLMs (R-LMs) by integrating neuro-symbolic methods, particularly through the development of a retrieval automaton. In contexts where R-LMs traditionally face computational challenges due to frequent datastore searches, this study offers a method that optimizes these retrieval processes through an automaton-augmented framework.

Approach and Methodology

Retrieval-based LLMs augment traditional LMs by leveraging external datastores to retrieve and incorporate relevant examples during inference. The primary insight driving this work is that the retrieved examples at one time step can be indicative of their successors, thereby reducing the necessity for repetitive datastore searches. To operationalize this, the authors construct a weighted finite automaton (WFA) over the datastore. This automaton is created by:

Maintaining Pointers: Each entry in the datastore is enriched with a pointer to the subsequent entry in the text source, effectively creating a linked sequence of entries.
Clustering Entries: Similar datastore entries are clustered, and these clusters form the states of the automaton. Transitions between states are informed by the pointers, which provide structured pathways of probable retrieval sequences.

The automaton is traversed in parallel with the LM, reducing the perplexity by significant margins and lowering the need for frequent $k$ -nearest neighbor (kNN) searches by up to 83% without adversely affecting perplexity.

Empirical Evaluation and Results

The proposed approach, termed the retrieval automaton, was evaluated on two fronts: in-domain datastore setups using the WikiText-103 corpus and domain adaptation via Law-MT. The results indicate that the automaton not only reduces the perplexity of LLMs significantly compared to base R-LMs but also maintains this advantage across various fractions of saved searches (FoSS).

For instance, on WikiText-103, the automaton reduced perplexity to 16.08 from a baseline of 16.65 at FoSS=0 (no search savings) and maintained competitive performance even at higher FoSS values. In domain adaptation settings, such as with the Law-MT dataset, the automaton further demonstrated the utility of its structure, achieving a perplexity of 10.49 against the baseline’s 12.34. Notably, even in the presence of a fine-tuned LM, the automaton augmented approach yielded a 17.5% reduction in perplexity.

Theoretical and Practical Implications

The approach underscores the potential of neuro-symbolic synergies, demonstrating how symbolic structures like automata can augment the inherent capabilities of modern deep learning models. It proves particularly beneficial in alleviating the computational burdens associated with datastore searches in R-LMs. Practically, such advancements could lead to more efficient applications of LMs across domains where datastore sizes are large, and retrieval accuracy is critical.

Future Directions

Future exploration could further refine clustering methodologies and investigate dynamic interpolation schemes that synergize with automaton-based retrievals. Moreover, expanding this concept to other retrieval granularities, such as sentence or paragraph-level retrieval, could potentially amplify its applicability across diverse NLP tasks.

In summary, the study offers a comprehensive method to reimagine retrieval processes in language modeling, presenting a compelling case for hybridizing neural approaches with symbolic automata to optimize retrieval efficiency and elevate language understanding capabilities.

Markdown Report Issue