Resources and Evaluations for Multi-Distribution Dense Information Retrieval
Abstract: We introduce and define the novel problem of multi-distribution information retrieval (IR) where given a query, systems need to retrieve passages from within multiple collections, each drawn from a different distribution. Some of these collections and distributions might not be available at training time. To evaluate methods for multi-distribution retrieval, we design three benchmarks for this task from existing single-distribution datasets, namely, a dataset based on question answering and two based on entity matching. We propose simple methods for this task which allocate the fixed retrieval budget (top-k passages) strategically across domains to prevent the known domains from consuming most of the budget. We show that our methods lead to an average of 3.8+ and up to 8.0 points improvements in Recall@100 across the datasets and that improvements are consistent when fine-tuning different base retrieval models. Our benchmarks are made publicly available.
- Reasoning over Public and Private Data in Retrieval-Based Systems. Transactions of Computational Linguistics (TACL) (2023). https://arxiv.org/abs/2203.11027
- Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).
- A theory of learning from different domains. Machine learning 79 (2010), 151–175.
- SIGIR 2023 Workshop on Retrieval Enhanced Machine Learning (REML @ SIGIR 2023). In Proceedings of SIGIR. https://maroo.cs.umass.edu/pub/web/getpdf.php?id=1475
- Improving language models by retrieving from trillions of tokens. In arXiv:2112.04426v2.
- Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL).
- Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1870–1879. https://doi.org/10.18653/v1/P17-1171
- A Dataset for Answering Time-Sensitive Questions. NeurIPS (2021).
- HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data. In Findings of the Association for Computational Linguistics (EMNLP).
- The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Ravi Teja Gadde and Ivan Bulyko. 2021. Towards Continual Entity Learning in Language Models for Conversational Agents. arXiv preprint arXiv:2108.00082 (2021).
- Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research 17, 59 (2016), 1–35. http://jmlr.org/papers/v17/15-239.html
- Discovering emerging entities with ambiguous names. In Proceedings of the 23rd international conference on World wide web. 385–396.
- The knowledge awakens: Keeping knowledge bases fresh with emerging entities. In Proceedings of the 25th International Conference Companion on World Wide Web. 203–206.
- Towards Continual Knowledge Learning of Language Models. International Conference on Learning Representations (2022).
- Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
- Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment 3, 1-2 (2010), 484–493.
- PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. Transactions of the Association for Computational Linguistics 9 (2021), 1098–1115. https://doi.org/10.1162/tacl_a_00415
- Transferable adversarial training: A general approach to adapting deep classifiers. In International Conference on Machine Learning. PMLR, 4013–4022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Combining Open Domain Question Answering with a Task-Oriented Dialog System. In Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021). Association for Computational Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2021.dialdoc-1.5
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS) (2021).
- Ellen M Voorhees. 1999. The TREC-8 question answering track report. In TREC.
- Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval. International Conference on Learning Representations (2021).
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600 (2018).
- When Language Model Meets Private Library. arXiv:2210.17236Â [cs.PL]
- Michael J. Q. Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating Extra-Linguistic Contexts into QA. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.