How to build an Open Science Monitor based on publications? A French perspective

Published 6 Jan 2025 in cs.DL | (2501.02856v1)

Abstract: Many countries and institutions are striving to develop tools to monitor their open science policies. Since 2018, with the launch of its National Plan for Open Science, France has been progressively implementing a monitoring framework for its public policy, relying exclusively on reliable, open, and controlled data. Currently, this monitoring focuses on research outputs, particularly publications, as well as theses and clinical trials. Publications serve as a basis for analyzing other dimensions, including research data, code, and software. The metadata associated with publications is therefore particularly valuable, but the methodology for leveraging it raises several challenges. Here, we briefly outline how we have used this metadata to construct the French Open Science Monitor.

Abstract PDF Upgrade to Chat

Summary

The paper details the methodology for building a national Open Science Monitor based on publication data using Persistent Identifiers and machine learning, focusing on the French context.
It outlines the technical architecture and computational resources required, involving significant data processing and a reliance on open-source tools for metadata enrichment and analysis.
The authors discuss challenges such as corpus creation, author disambiguation, and cost estimation, proposing solutions and future directions for refining open science monitoring systems.

Constructing a French Open Science Monitor: Methodologies and Implications

The paper "How to build an Open Science Monitor based on publications? A French perspective" by Laetitia Bracco et al. provides a detailed exploration of the methodologies utilized to establish an Open Science Monitor (OSM) based on publication data within the context of the French National Plan for Open Science. The paper outlines the inherent complexities and challenges in curating a system that effectively monitors open science policies relying on publications' metadata.

Methodological Framework

The development of the French OSM is centered around the gathering and utilization of metadata associated with research outputs, which primarily includes scholarly publications, theses, and clinical trials. The foundation of this model is grounded in Persistent Identifiers (PIDs) such as Crossref DOIs, which furnish a stable basis for deriving metadata. Supplementary metadata elements, like scientific field classification and publication type, are crucial. These elements are derived using machine learning models to normalize and infer unsupplied metadata where necessary.

The French model leverages diverse data sources, incorporating global databases and institutional repositories, facilitated by sophisticated tools like the affiliation matcher and Text and Data Mining (TDM) strategies, which reveal insights into datasets and software dependencies. The paper emphasizes the integration of open-source solutions such as Unpaywall for determining OA status and Grobid for scholarly PDF parsing, underscoring the reliance on a modular and adaptive software infrastructure.

Technical and Computational Considerations

Bracco et al. divulge the computational architecture underpinning the French OSM. This infrastructure, orchestrated via Kubernetes on the OVH public cloud, incorporates numerous servers optimized for specific tasks such as data acquisition, TDM analysis, and metadata storage using Elasticsearch. The implementation demands significant computational resources, with costs pronouncedly attributed to PDF TDM analysis.

Human resource allocation is an additional consideration, where domain expertise in open science and scholarly communications is indispensable. The authors estimate implementation and maintenance scales at approximately 2 Full-Time Equivalents (FTEs) initially, with ongoing maintenance necessitating around 0.5 FTE per annum.

Key Challenges and Solutions

The paper identifies several methodological challenges inherent in open science monitoring. Corpus creation often encounters issues in defining research output scopes and handling duplicates, necessitating strategies such as PID prioritization and metadata merging. Efficient author disambiguation and affiliation attribution are facilitated through tools like Grobid and the Works-magnet, alongside leveraging extensive national registries.

In terms of corpus enrichment, machine learning models enhance dimension analysis, particularly evident in discipline classification and open access status determination. APC estimation emerges as a complex task due to institutional collaboration and variability in transformative agreements.

Implications and Future Directions

The establishment of the French OSM signifies a crucial step in formalizing open science monitoring through quantitative and systematic methodologies. The paper emphasizes the importance of adapting the monitoring framework to varied national and institutional contexts, suggesting potential adaptability as a form of standardizing open science metrics globally.

Future advancements could focus on refining TDM models for enhanced dataset and software mention detection, improving APC estimation procedures, and possibly expanding the database incorporation beyond European directives. Additionally, leveraging open infrastructures could mitigate operational costs, accentuating the prospects for a shared global framework.

In conclusion, Bracco et al.'s work on the French Open Science Monitor provides a robust blueprint for countries and institutions seeking to establish comprehensive monitoring systems for open science policies. The intricate methodologies and consideration of open-access indicators notwithstanding, the research suggests further exploration into real-time adaptation, more refined computational models, and potentially broader applications across various legal and scientific ecosystems.

Markdown Report Issue