End-to-End Open-Domain Question Answering with BERTserini

Published 5 Feb 2019 in cs.CL and cs.IR | (1902.01718v2)

Abstract: We demonstrate an end-to-end question answering system that integrates BERT with the open-source Anserini information retrieval toolkit. In contrast to most question answering and reading comprehension models today, which operate over small amounts of input text, our system integrates best practices from IR with a BERT-based reader to identify answers from a large corpus of Wikipedia articles in an end-to-end fashion. We report large improvements over previous results on a standard benchmark test collection, showing that fine-tuning pretrained BERT with SQuAD is sufficient to achieve high accuracy in identifying answer spans.

Abstract PDF Upgrade to Chat

Citations (477)

View on Semantic Scholar

Summary

The paper demonstrates that integrating BERT with the Anserini IR toolkit yields robust open-domain QA performance, evidenced by an exact match score of 38.6 on SQuAD.
It employs BM25-based paragraph retrieval followed by a modified BERT reader to precisely extract answer spans from large text corpora.
Experimental results emphasize that optimized retrieval and refined inference strategies can notably enhance practical QA applications, including chatbot interfaces.

End-to-End Open-Domain Question Answering with BERTserini

The paper presents BERTserini, an integration of BERT with the Anserini information retrieval (IR) toolkit, crafted to address the challenges of open-domain question answering (QA). This work merges effective IR practices with a robust machine reading comprehension model in an end-to-end system, showing substantial performance gains over existing methodologies.

End-to-end QA systems, which encompass both document retrieval and answer extraction, have garnered renewed interest in recent years. Traditional QA systems often relied heavily on pipelined architectures consisting of document retrieval, passage ranking, and answer extraction stages. With recent advances in neural models, QA research has shifted toward improving these latter stages at the cost of reintegrating robust retrieval architectures.

The advent of self-supervised models like BERT has profoundly influenced numerous NLP tasks. Its application to passage reranking and answer span identification has set new benchmarks for performance. However, its integration within open-domain settings—a scenario necessitating initial text retrieval from extensive corpora—remains underexplored. BERTserini seeks to address this gap by combining the pretrained prowess of BERT with the community-honed retrieval strategies embodied in Anserini.

System Architecture

The BERTserini architecture consists of two primary modules: the Anserini retriever and the BERT reader. The Anserini retriever utilizes BM25 for indexing, allowing for efficient text retrieval from a sizable Wikipedia corpus. Once potential text segments are identified, the BERT reader processes these segments to ascertain precise answer spans. Notably, the system refines the typical BERT model by removing the final softmax layer to facilitate aggregation of results across multiple text segments.

Experimental Results

Through rigorous evaluation on the SQuAD dataset, BERTserini showcases marked improvements in exact match (EM) and F1 scores, especially under paragraph-level retrieval conditions. Paragraph-based retrieval emerged as the most effective strategy, balancing document context considerations and distractor minimization. The system's EM score reached 38.6, reflecting a significant advancement over prior approaches.

A detailed analysis of $k$ , the number of retrieved paragraphs, revealed that while recall marginally increases with higher $k$ , EM scores plateau after a particular point. This indicates that while retrieval effectiveness is crucial, there are considerable opportunities for enhancing inference and scoring strategies.

Implications and Future Work

The paper's findings have significant implications for both practical applications and theoretical exploration. Practically, the BERTserini model can be immediately deployed in chatbot interfaces, as demonstrated by its integration into RSVP.ai's platform, providing a service that balances inference quality with response latency. Theoretically, the work suggests further research directions in optimizing retrieval-score aggregation and expanding cross-linguistic capabilities.

Future developments may focus on enhancing retrieval mechanisms, refining BERT's inference capabilities, and integrating reranking models for more sophisticated relevance assessment. Additionally, an emphasis on multilingual support could further extend the applicability of the system across different linguistic contexts.

In conclusion, BERTserini represents a notable contribution to the domain of open-domain QA systems, leveraging the synergy between advanced retrieval tactics and state-of-the-art language understanding models to push the boundaries of automated answer extraction.