MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science Domain
Abstract: Scientific publications follow conventionalized rhetorical structures. Classifying the Argumentative Zone (AZ), e.g., identifying whether a sentence states a Motivation, a Result or Background information, has been proposed to improve processing of scholarly documents. In this work, we adapt and extend this idea to the domain of materials science research. We present and release a new dataset of 50 manually annotated research articles. The dataset spans seven sub-topics and is annotated with a materials-science focused multi-label annotation scheme for AZ. We detail corpus statistics and demonstrate high inter-annotator agreement. Our computational experiments show that using domain-specific pre-trained transformer-based text encoders is key to high classification performance. We also find that AZ categories from existing datasets in other domains are transferable to varying degrees.
- Argumentation mining in scientific literature: From computational linguistics to biomedicine. In Frommholz I, Mayr P, Cabanac G, Verberne S, editors. BIR 2021: 11th International Workshop on Bibliometric-enhanced Information Retrieval; 2021 Apr 1; Lucca, Italy. Aachen: CEUR; 2021. p. 20-36. CEUR Workshop Proceedings.
- Automatic zone identification in scientific papers via fusion techniques. Scientometrics, 119(2):845–862.
- Zone identification based on features with high semantic richness and combining results of separate classifiers. Journal of Information and Telecommunication, 2(4):411–427.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- Sequential sentence classification in research papers using cross-domain multi-task learning. CoRR, abs/2102.06008.
- Cross-domain multi-task learning for sequential sentence classification in research papers. In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries, pages 1–13.
- Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163:3–16. Recent Advancements in Hybrid Artificial Intelligence Systems and its Application to Real-World Problems Progress in Intelligent Systems Mining Humanistic Data.
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
- G. Bennemann de Moura and V. Delisandra Feltrim. 2018. Using lstm encoder-decoder for rhetorical structure prediction. In 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), pages 278–283, Los Alamitos, CA, USA. IEEE Computer Society.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- ARTU / TU Wien and artificial researcher@ LongSumm 20. In Proceedings of the First Workshop on Scholarly Document Processing, pages 310–317, Online. Association for Computational Linguistics.
- Argumentative zoning applied to critiquing novices’ scientific abstracts. In Computing Attitude and Affect in Text: Theory and Applications, pages 233–246. Springer.
- A multi-layered annotated corpus of scientific papers. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3081–3088, Portorož, Slovenia. European Language Resources Association (ELRA).
- On the discoursive structure of computer graphics research papers. In Proceedings of The 9th Linguistic Annotation Workshop, pages 42–51, Denver, Colorado, USA. Association for Computational Linguistics.
- The SOFC-exp corpus and neural approaches to information extraction in the materials science domain. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1255–1268, Online. Association for Computational Linguistics.
- Ann-Sophie Gnehm and Simon Clematide. 2020. Text zoning and classification for job advertisements in German, French and English. In Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science, pages 83–93, Online. Association for Computational Linguistics.
- A weakly-supervised approach to argumentative zoning of scientific documents. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 273–283, Edinburgh, Scotland, UK. Association for Computational Linguistics.
- Improved information structure analysis of scientific documents through discourse and lexical constraints. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 928–937, Atlanta, Georgia. Association for Computational Linguistics.
- MatSciBERT: A materials domain language model for text mining and information extraction. npj Computational Materials, 8(1):102.
- A survey of methods for addressing class imbalance in deep-learning based natural language processing. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 523–540, Dubrovnik, Croatia. Association for Computational Linguistics.
- Justin M Johnson and Taghi M Khoshgoftaar. 2019. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):1–54.
- The INCEpTION platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 5–9, Santa Fe, New Mexico. Association for Computational Linguistics.
- Klaus Krippendorff. 1980. Krippendorff, Klaus, Content Analysis: An Introduction to its Methodology. Beverly Hills, CA: Sage, 1980. Sage Publications, Inc.
- J Richard Landis and Gary G Koch. 1977. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, pages 363–374.
- Investigating the role of argumentation in the rhetorical analysis of scientific publications with neural multi-task learning models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3326–3338, Brussels, Belgium. Association for Computational Linguistics.
- Maria Liakata and Larisa Soldatova. 2008. Guidelines for the annotation of general scientific concepts. Aberystwyth University, JISC Project Report http://ie-repository. jisc. ac. uk/88.
- Haixia Liu. 2017. Automatic argumentative-zoning using word2vec. CoRR, abs/1703.10152.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
- A deep learning-based method of argumentative zoning for research articles. Data Analysis and Knowledge Discovery, 4(6):60–68.
- The materials science procedural text corpus: Annotating materials synthesis procedures with shallow semantic structures. In Proceedings of the 13th Linguistic Annotation Workshop, pages 56–64, Florence, Italy. Association for Computational Linguistics.
- MS-mentions: Consistently annotating entity mentions in materials science procedural text. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1337–1352, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Carlos N Silla and Alex A Freitas. 2011. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1):31–72.
- Larisa Soldatova and Maria Liakata. 2007. An ontology methodology and cisp-the proposed core information about scientific papers. JISC Project Report.
- John M. Swales. 1990. Discourse analysis in professional contexts. Annual Review of Applied Linguistics, 11:103–114.
- Simone Teufel. 2006. Argumentative zoning for improved citation indexing. In Computing attitude and affect in text: Theory and Applications, pages 159–169. Springer.
- An annotation scheme for discourse-level argumentation in research articles. In Ninth Conference of the European Chapter of the Association for Computational Linguistics, pages 110–117, Bergen, Norway. Association for Computational Linguistics.
- Simone Teufel and Min-Yen Kan. 2009. Robust argumentative zoning for sensemaking in scholarly documents. In Advanced language technologies for digital libraries, pages 154–170. Springer.
- Simone Teufel and Marc Moens. 1999. Discourse-level argumentation in scientific articles: human and automatic annotation. In Towards Standards and Tools for Discourse Tagging.
- Simone Teufel and Marc Moens. 2002. Articles summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4):409–445.
- Towards domain-independent argumentative zoning: Evidence from chemistry and computational linguistics. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1493–1502, Singapore. Association for Computational Linguistics.
- Yoshimasa Tsuruoka and Jun’ichi Tsujii. 2005. Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 467–474, Vancouver, British Columbia, Canada. Association for Computational Linguistics.
- Rob van der Goot. 2021. We need to talk about train-dev-test splits. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4485–4494, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- SC-CoMIcs: A superconductivity corpus for materials informatics. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6753–6760, Marseille, France. European Language Resources Association.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.