Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring

Published 22 Apr 2019 in cs.CL and cs.AI | (1905.01969v4)

Abstract: The use of deep pre-trained bidirectional transformers has led to remarkable progress in a number of applications (Devlin et al., 2018). For tasks that make pairwise comparisons between sequences, matching a given input with a corresponding label, two approaches are common: Cross-encoders performing full self-attention over the pair and Bi-encoders encoding the pair separately. The former often performs better, but is too slow for practical use. In this work, we develop a new transformer architecture, the Poly-encoder, that learns global rather than token level self-attention features. We perform a detailed comparison of all three approaches, including what pre-training and fine-tuning strategies work best. We show our models achieve state-of-the-art results on three existing tasks; that Poly-encoders are faster than Cross-encoders and more accurate than Bi-encoders; and that the best results are obtained by pre-training on large datasets similar to the downstream tasks.

Abstract PDF Upgrade to Chat

Citations (265)

View on Semantic Scholar

Summary

The paper introduces Poly-encoders, a novel hybrid transformer architecture that blends the strengths of Bi- and Cross-encoders for efficient multi-sentence scoring.
It employs domain-specific pre-training, notably on Reddit data, to improve performance on dialogue and information retrieval tasks.
Experimental results show that Poly-encoders maintain competitive accuracy while significantly reducing inference time compared to Cross-encoders.

Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring

Introduction

This paper explores the transformer-based architectures designed to efficiently handle multi-sentence scoring tasks, which involve matching an input context with a corresponding label. It addresses the trade-off between prediction quality and speed. The study introduces Poly-encoders, which balance the performance of Cross-encoders and the efficiency of Bi-encoders, both common techniques for sequence comparison tasks. The results demonstrate that Poly-encoders offer state-of-the-art performance across various datasets while achieving significant speed gains.

Model Architectures

The study evaluates three key architectures: Bi-encoders, Cross-encoders, and the newly proposed Poly-encoders.

Bi-encoders: Independent encoding of input context and candidate label, allowing caching for fast retrieval but potentially less accuracy due to lack of interaction between context and candidate.
Cross-encoders: Joint encoding through a single transformer, offering rich interaction but at a computational cost, limiting feasibility in real-time applications.
Poly-encoders: Hybrid model that caches candidate representations like Bi-encoders while introducing a learned attention mechanism to capture richer interactions without the overhead of Cross-encoders.
Figure 1: Diagrams of the three model architectures we consider. (a) Bi-encoder, (b) Cross-encoder, (c) Poly-encoder.

Pre-training and Fine-tuning Strategies

The paper discusses the significance of pre-training on datasets closely related to the target domain tasks to enhance performance. Besides replicating the BERT training on Wikipedia and BooksCorpus, a domain-specific pre-training on Reddit data was performed, aiming at dialogue tasks to test the effectiveness of contextually similar data.

Pre-training on Reddit: Demonstrated improved performance across dialogue tasks, surpassing results obtained through traditional BERT pre-training. This strategy underscores the value of domain-specific data in achieving practical performance gains.

Experimental Results

The proposed architectures and training strategies were evaluated on four tasks: ConvAI2, DSTC7, Ubuntu V2, and Wikipedia Article Search. The Poly-encoder architecture consistently outperformed Bi-encoders while offering a significant speed advantage over Cross-encoders.

ConvAI2 Test Results: Poly-encoders achieved higher R@1 scores than Bi-encoders and closely rivaled Cross-encoders.
DSTC7 and Ubuntu V2: The Poly-encoders, especially with Reddit pre-training, set new benchmarks, suggesting effective balance in speed and accuracy.

Inference Speed and Practical Implementation

The inference speed of Poly-encoders aligns more closely with Bi-encoders, making them suitable for real-world applications where rapid evaluation is critical. The ability to cache candidate representations allows for scalable deployment scenarios with large candidate pools, a significant advantage over Cross-encoders.

Conclusion

The introduction of Poly-encoders provides a viable solution for tasks requiring efficient multi-sentence scoring without compromising on performance. By leveraging domain-specific pre-training and a balanced architecture, it represents an advancement in transformer utilization for practical AI applications. The results suggest that task-specific data can substantially improve model performance, stressing the significance of data selection in pre-training phases. These findings could inform future developments and implementations in areas requiring quick and accurate candidate scoring mechanisms.