Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exploring Scientometrics with the OpenAIRE Graph: Introducing the OpenAIRE Beginner's Kit

Published 19 Sep 2024 in cs.DL | (2409.12690v1)

Abstract: The OpenAIRE Graph is an extensive resource housing diverse information on research products, including literature, datasets, and software, alongside research projects and other scholarly outputs and context. It stands as a cornerstone among contemporary research information databases, offering invaluable insights for scientometric investigations. Despite its wealth of data, its sheer size may initially appear daunting, potentially hindering its widespread adoption. To address this challenge, this paper introduces the OpenAIRE Beginner's Kit, a user-friendly solution providing access to a subset of the OpenAIRE Graph within a sandboxed environment coupled with a Jupyter notebook for analysis. The OpenAIRE Beginner's Kit is meticulously designed to democratise research and data exploration, offering accessibility from standard desktop and laptop setups. Within this paper, we provide a brief overview of the included dataset and offer guidance on leveraging the kit through a selection of illustrative queries tailored to address common scientometric inquiries.

Summary

  • The paper introduces a user-friendly subset of the OpenAIRE Graph that reduces accessibility challenges for analyzing extensive scientometric data.
  • It employs a sandboxed Jupyter notebook with SparkSQL and Python tools to efficiently explore over 240 million research outputs and 5 billion relationships.
  • The study demonstrates query examples for citation, access rights, and co-authorship analysis, providing practical insights for scholarly evaluation.

Exploring Scientometrics with the OpenAIRE Graph: Introducing the OpenAIRE Beginner's Kit

The paper meticulously authored by Andrea Mannocci and Miriam Baglioni introduces a pioneering step towards democratizing the accessibility of comprehensive scientometric data through the OpenAIRE Beginner’s Kit. This initiative, nested within the larger OpenAIRE ecosystem, seeks to mitigate the entry barriers typically associated with massive research databases. The OpenAIRE Beginner's Kit provides a user-friendly, subset-oriented approach to exploring the expansive OpenAIRE Graph, leveraging a sandboxed environment and Jupyter notebooks for seamless data analysis.

The OpenAIRE Graph: An Overview

The OpenAIRE Graph is an integral open-access resource for scholarly communication, encapsulating nearly 240 million research outputs and over 5 billion relationships between entities derived from a plethora of sources. This graph encompasses metadata from publications, datasets, software, projects, and organizational affiliations, aggregated and cleaned from robust sources like Crossref, Datacite, and ORCID. Enrichment processes, including disambiguation and the application of various mining algorithms, enhance this metadata, ensuring comprehensive and accurate scholarly records. The dataset, approximately 270 GB in compressed JSON format, serves as a rich foundation for extensive scientometric research.

Accessibility Challenges and the OpenAIRE Beginner's Kit

Despite the comprehensive nature of the OpenAIRE Graph, its sheer size poses significant accessibility challenges, especially for new users. To address this, the OpenAIRE Beginner’s Kit offers a subset of the Graph in a manageable format coupled with a Jupyter notebook, streamlining the initial engagement with the dataset. This kit is designed to be operable on standard desktop and laptop setups, thereby broadening the accessibility of sophisticated scientometric data analysis. Experienced researchers and newcomers alike can leverage this tool to gain insights into the Graph’s structure and data types before scaling to the larger dataset.

Components of the OpenAIRE Beginner's Kit

Dataset: The subset included in the kit comprises 3,919,148 publications, 808,583 datasets, 27,470 software entries, 167,367 other research outputs, 53,546 data sources, and 55,633 organizations, among others. This meticulously curated subset reflects research products published in the eight months preceding the kit's release date (March 2024), ensuring a comprehensive representation of various entities and their relationships within the Graph.

Jupyter Notebook: The accompanying notebook facilitates data exploration using SparkSQL and Python within a Docker container virtualizing an Apache Hadoop cluster. It simplifies the querying process through virtual SQL tables, allowing efficient data parallelism. The notebook encompasses a range of built-in libraries, such as pandas for data manipulation and igraph for network analysis, empowering users to conduct advanced scientometric investigations effortlessly.

Illustrative Queries and Their Implications

The notebook provides illustrative queries addressing common scientometric inquiries, exemplifying the ease with which users can interact with the dataset:

  • Citation Analysis: Users can execute SQL-like queries to count citations for various publications, offering pivotal insights into scholarly influence.
  • Access Rights Analysis: Queries that group data by organizational affiliations and access rights (e.g., open, embargo, closed) facilitate the exploration of open access trends, underscoring the movement towards open science.
  • Collaborations and Co-Authorship: Complex queries identifying partnerships through project affiliations can be visualized as network graphs, providing a comprehensive understanding of global collaborative patterns in research.

Future Developments and Extensions

The authors acknowledge current limitations, notably the inability to perform diachronic analyses due to the dataset's temporal scope. Future enhancements aim to include broader temporal coverage and discipline diversity, bolstering the kit’s utility in longitudinal studies. Additionally, the integration of more focused and stakeholder-specific notebooks is intended to tailor the tool’s applicability to diverse research and funding landscapes.

Conclusion

The OpenAIRE Beginner's Kit marks a significant milestone in making extensive scientometric data more accessible and actionable. By providing a subset of the OpenAIRE Graph in an easily navigable format, coupled with a robust analytical framework through Jupyter notebooks, the kit paves the way for more inclusive and efficient scientometric inquiries. The continued development and refinement of this tool hold the promise of further democratizing access to invaluable research insights, fostering a more comprehensive and accessible domain of scholarly evaluation.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 16 likes about this paper.