SurveySum: A Dataset for Summarizing Multiple Scientific Articles into a Survey Section
Abstract: Document summarization is a task to shorten texts into concise and informative summaries. This paper introduces a novel dataset designed for summarizing multiple scientific articles into a section of a survey. Our contributions are: (1) SurveySum, a new dataset addressing the gap in domain-specific summarization tools; (2) two specific pipelines to summarize scientific articles into a section of a survey; and (3) the evaluation of these pipelines using multiple metrics to compare their performance. Our results highlight the importance of high-quality retrieval stages and the impact of different configurations on the quality of generated summaries.
- Howsumm: A multi-document summarization dataset derived from wikihow articles, 2021.
- Language Models are Few-Shot Learners, 2020.
- SPECTER: Document-level Representation Learning using Citation-informed Transformers. In ACL, 2020.
- Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model, 2019.
- A large-scale multi-document summarization dataset from the wikipedia current events portal, 2020.
- SumPubMed: Summarization dataset of PubMed scientific articles. In Jad Kabbara, Haitao Lin, Amandalynne Paullada, and Jannis Vamvas, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 292–303, Online, August 2021. Association for Computational Linguistics.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- An empirical survey on long document summarization: Datasets, models, and metrics. ACM Computing Surveys, 55(8):1–35, December 2022.
- Generating a structured summary of numerous academic papers: Dataset and method. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-2022. International Joint Conferences on Artificial Intelligence Organization, July 2022.
- Long text and multi-table summarization: Dataset and method, 2023.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. In Conference on Empirical Methods in Natural Language Processing, 2023.
- Multi-xscience: A large-scale dataset for extreme multi-document summarization of scientific articles, 2020.
- Automatic summarization. Foundations and Trends® in Information Retrieval, 5(2–3):103–233, 2011.
- Document ranking with a pretrained sequence-to-sequence model. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718, Online, November 2020. Association for Computational Linguistics.
- Document expansion by query prediction, 2019.
- Check-eval: A checklist-based approach for evaluating text quality, 2024.
- Okapi at TREC-3. In Donna K. Harman, editor, Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, volume 500-225 of NIST Special Publication, pages 109–126. National Institute of Standards and Technology (NIST), 1994.
- Semantic Scholar. https://www.semanticscholar.org/.
- ScisummNet: A large annotated corpus and content-impact models for scientific paper summarization with citation networks. In Proceedings of AAAI 2019, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.