Papers
Topics
Authors
Recent
Search
2000 character limit reached

NewsQs: Multi-Source Question Generation for the Inquiring Mind

Published 28 Feb 2024 in cs.CL | (2402.18479v2)

Abstract: We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judged acceptable more often than the same model without them as measured through human evaluation. We use a QNLI model with high correlation with human annotations to filter our data. We release our final dataset of high-quality questions, answers, and document clusters as a resource for future work in query-based multi-document summarization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. YAKE! Keyword extraction from single documents using multiple local features. Information Sciences, 509:257–289. Publisher: Elsevier.
  2. Yake! collection-independent automatic keyword extractor. In European Conference on Information Retrieval, pages 806–810. Springer.
  3. YAKE! Collection-Independent Automatic Keyword Extractor. In Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury, editors, Advances in Information Retrieval, volume 10772, pages 806–810. Springer International Publishing, Cham. Series Title: Lecture Notes in Computer Science.
  4. CONSISTENT: Open-Ended Question Generation From News Articles.
  5. Mark Davies. 2022. Corpus of News on the Web (NOW).
  6. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383–1392. Association for Computational Linguistics. Event-place: Melbourne, Australia.
  7. Question Generation for Question Answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 866–874, Copenhagen, Denmark. Association for Computational Linguistics.
  8. DUC. 2007. Document Understanding Conferences - Past Data.
  9. A feasibility study of answer-agnostic question generation for education. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1919–1926, Dublin, Ireland. Association for Computational Linguistics.
  10. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
  11. ELI5: Long Form Question Answering. arXiv:1907.09190 [cs]. ArXiv: 1907.09190.
  12. A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1302–1308, Online. Association for Computational Linguistics.
  13. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications. ArXiv:1711.05073 [cs].
  14. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
  15. AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization.
  16. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.
  17. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  18. KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
  19. Language Models are Unsupervised Multitask Learners. page 24.
  20. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs, stat]. ArXiv: 1910.10683.
  21. Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv:2110.08207 [cs]. ArXiv: 2110.08207.
  22. ASQA: Factoid Questions Meet Long-Form Answers. arXiv:2204.06092 [cs]. ArXiv: 2204.06092.
  23. SQuALITY: Building a Long-Document Summarization Dataset the Hard Way. arXiv:2205.11465 [cs]. ArXiv: 2205.11465.
  24. BERTScore: Evaluating Text Generation with BERT.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.