The OpenCitations Index

Published 5 Aug 2024 in cs.DL | (2408.02321v1)

Abstract: This article presents the OpenCitations Index, a collection of open citation data maintained by OpenCitations, an independent, not-for-profit infrastructure organisation for open scholarship dedicated to publishing open bibliographic and citation data using Semantic Web and Linked Open Data technologies. The collection involves citation data harvested from multiple sources. To address the possibility of different sources providing citation data for bibliographic entities represented with different identifiers, therefore potentially representing same citation, a deduplication mechanism has been implemented. This ensures that citations integrated into OpenCitations Index are accurately identified uniquely, even when different identifiers are used. This mechanism follows a specific workflow, which encompasses a preprocessing of the original source data, a management of the provided bibliographic metadata, and the generation of new citation data to be integrated into the OpenCitations Index. The process relies on another data collection: OpenCitations Meta, and on the use of a new globally persistent identifier, namely OMID (OpenCitations Meta Identifier). As of July 2024, OpenCitations Index stores over 2 billion unique citation links, harvest from Crossref, the National Institute of Heath Open Citation Collection (NIH-OCC), DataCite, OpenAIRE, and the Japan Link Center (JaLC). OpenCitations Index can be systematically accessed and queried through several services, including SPARQL endpoint, REST APIs, and web interfaces. Additionally, dataset dumps are available for free download and reuse (under CC0 waiver) in various formats (CSV, N-Triples, and Scholix), including provenance and change tracking information.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a robust open citation index integrating over 2 billion unique citation links from multiple authoritative sources.
It introduces a novel deduplication mechanism that maps diverse bibliographic identifiers to globally unique OMIDs, ensuring data integrity.
The index supports transparent research with user-friendly access via SPARQL endpoints, REST APIs, and intuitive web interfaces.

The OpenCitations Index

The paper "The OpenCitations Index" by Ivan Heibi, Arianna Moretti, Silvio Peroni, and Marta Soricetti provides a comprehensive overview of a crucial infrastructure developed by OpenCitations: the OpenCitations Index. This index represents an extensive collection of open citation data, offering a rigorously processed and openly accessible dataset that can be utilized to foster transparency and reproducibility in academic research.

Core Contributions

The OpenCitations Index integrates citation data from multiple authoritative sources, including Crossref, NIH Open Citation Collection (NIH-OCC), DataCite, OpenAIRE, and the Japan Link Center (JaLC). As of July 2024, the index includes over 2 billion unique citation links, providing a vast resource for the academic community.

Deduplication Mechanism

A significant methodological advancement presented in this paper is the deduplication mechanism, which addresses the issue of varying identifiers for the same bibliographic entities across different sources. This process involves preprocessing source data, managing bibliographic metadata, and generating new citation data. The deduplication mechanism ensures that each citation integrated into the OpenCitations Index is uniquely identified, thereby maintaining data integrity across disparate data sources.

Methodological Workflow

The paper details a meticulous workflow designed for the efficient ingestion of citation data:

Source Preprocess:
- Extraction of data from original sources.
- Production of CSV tables with bibliographic metadata and citation data.
Meta Process:
- Mapping external persistent identifiers to a globally unique identifier (OMID).
- Integration with the OpenCitations Meta collection to deduplicate entities.
Index Process:
- Conversion of citation links to an OMID-to-OMID format.
- Generation of comprehensive datasets and updating the OpenCitations Index graph database.

Data Representation and Provenance

All citation data are modeled according to the OpenCitations Data Model (OCDM), which uses Semantic Web technologies. The OCDM represents citations as first-class entities with detailed metadata, including the citing and cited entities, citation creation date, citation timespan, and type of citation (e.g., author self-citation, journal self-citation).

Provenance and change tracking are integral to the dataset, ensuring transparency and traceability. The index captures the validity and invalidity dates, responsible agents, primary data sources, and update queries. Additionally, the dataset is described using VoID and DCAT vocabularies, enhancing interoperability.

Access and Usage

The OpenCitations Index can be accessed through several services:

SPARQL Endpoint: Allows complex queries using SPARQL.
REST API: Provides a straightforward way to access data programmatically.
Web Interfaces: Includes tools like YASGUI, OSCAR, and LUCINDA for searching, querying, and browsing data.

The index's openness and accessibility are reinforced by its release under a CC0 waiver, ensuring that data can be freely used, transformed, and integrated into other systems without restriction.

Community Impact

The OpenCitations Index has achieved significant uptake within the academic and research communities. Notable initiatives utilizing this index include OpenAIRE-Nexus, GraspOS, B!SON, PURE Suggest, Open Access Helper, ORBi, CHERRY, and StabiKat. These projects leverage citation data to enhance research assessment, journal recommendation, open access availability, and bibliometric analysis.

Future Directions

Future developments aim to further refine the quality of data in the OpenCitations Index. Key initiatives include the implementation of HERITRACE, a semantic data management system for human curation of citation data, and the integration of machine learning techniques for author disambiguation.

In conclusion, the OpenCitations Index represents a robust, meticulously curated collection of open citation data. Its comprehensive methodology, extensive dataset, and wide-ranging accessibility significantly contribute to advancing open scholarship, enabling transparent and reproducible research practices in the global academic community.

Markdown Report Issue