ARETE: an R package for Automated REtrieval from TExt with large language models

Published 6 Nov 2025 in cs.LG | (2511.04573v1)

Abstract: 1. A hard stop for the implementation of rigorous conservation initiatives is our lack of key species data, especially occurrence data. Furthermore, researchers have to contend with an accelerated speed at which new information must be collected and processed due to anthropogenic activity. Publications ranging from scientific papers to gray literature contain this crucial information but their data are often not machine-readable, requiring extensive human work to be retrieved. 2. We present the ARETE R package, an open-source software aiming to automate data extraction of species occurrences powered by LLMs, namely using the chatGPT Application Programming Interface. This R package integrates all steps of the data extraction and validation process, from Optical Character Recognition to detection of outliers and output in tabular format. Furthermore, we validate ARETE through systematic comparison between what is modelled and the work of human annotators. 3. We demonstrate the usefulness of the approach by comparing range maps produced using GBIF data and with those automatically extracted for 100 species of spiders. Newly extracted data allowed to expand the known Extent of Occurrence by a mean three orders of magnitude, revealing new areas where the species were found in the past, which mayhave important implications for spatial conservation planning and extinction risk assessments. 4. ARETE allows faster access to hitherto untapped occurrence data, a potential game changer in projects requiring such data. Researchers will be able to better prioritize resources, manually verifying selected species while maintaining automated extraction for the majority. This workflow also allows predicting available bibliographic data during project planning.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces ARETE, an R package that automates species occurrence extraction from unstructured texts using large language models and robust validation protocols.
ARETE’s multi-stage pipeline preprocesses OCR outputs, normalizes species names, and employs fine-tuned LLMs, achieving up to a 0.958 F1 score in data extraction.
The system demonstrates cost-effective scalability by matching expert performance at up to 315x lower cost per data point compared to manual annotation.

ARETE: Automated Retrieval of Species Occurrence Data Using LLMs via R

Motivation and Context

This paper presents ARETE, an R package designed to automate the extraction of species occurrence data from unstructured textual sources by leveraging LLMs, specifically via the OpenAI API (GPT-3.5-turbo-1106 and GPT-40-mini-2024-07-18). The motivation stems from pronounced gaps in spatial occurrence data, especially for invertebrate taxa, manifesting as the Wallacean shortfall—a fundamental impediment to rigorous conservation planning and extinction risk assessment due to the absence of easily accessible occurrence records. Traditional biodiversity databases (e.g. GBIF, iNaturalist) are limited by data coverage and spatial biases, while substantial unstructured occurrence information remains trapped in scientific literature and gray documents that are not machine-readable.

The introduction of generative AI for biodiversity data extraction creates opportunities for rapid information mobilization but prior attempts have lacked generalizable, standardized software packages and robust validation of ecological data extraction performance.

ARETE Workflow and Technical Architecture

ARETE implements a multi-stage text-to-data pipeline:

Input Handling: Accepts PDF and TXT files with OCR (via Nougat, which builds on Tesseract) for PDFs lacking embedded text. Species queries can be targeted or left open for all extractable data.
Preprocessing: Normalizes characters, standardizes species names, and removes ambiguous symbols, improving LLM prompt efficacy and downstream table formatting.
LLM Request Management: Splits processed text into manageable chunks according to context window limits and feeds them to carefully constructed prompts tailored for the chosen LLM API endpoint (currently OpenAI's chatGPT). Prompts instruct the model to extract tabular data, reporting species, location, and precise coordinates only for in-text occurrences, explicitly filtering out references and non-relevant mentions.
Outlier Detection: Integrates post-processing via the gecko R package to flag anomalous records. This includes geographic outliers (high spatial distance from cluster centers), and environmental outliers based on multidimensional climate layers (WorldClim), and SVM-derived pseudo-SDMs (via kernlab's ksvm). Thresholds are configurable, with flagged points not automatically excluded, empowering user-driven curation.
Validation Tools: The package computes accuracy, recall, precision, and F1 scores for extracted coordinates, inversely weighted by geographical error magnitude. Locality string similarity is measured with mean minimum Levenshtein distance to ground-truth annotations. All metrics are rendered as modular reports (.Rmd), preserving interpretability for each extraction batch.

This pipeline abstracts LLM interaction, error handling, and model validation for non-expert R users in ecology and conservation biology, facilitating high-throughput extraction with configurable quality assurance.

Model Evaluation and Fine-tuning

ARETE supports use of both off-the-shelf and fine-tuned LLM models. Fine-tuning is performed via the OpenAI API/browser, using the RECODE corpus—a manually annotated relational ecological dataset with explicit occurrence data for target taxa. The impact of fine-tuning is rigorously evaluated:

Out-of-the-box Model Performance (gpt-3.5-turbo-1106): On 50 annotated papers, baseline accuracy reached 0.714, recall 0.764, precision 0.917, and F1 score 0.833. Manual checks attribute approximately half of false positives to OCR errors rather than LLM extraction, with only minor incidence of incorrect species–location associations or hallucinated entries.
Fine-tuned Model Performance: K-fold manual scoring yields substantial improvement—accuracy to 0.921, recall to 0.972, precision to 0.945, F1 to 0.958—demonstrating empirical benefit of domain-specific fine-tuning on ecological extraction tasks.
Human vs. LLM Cost and Throughput: ARETE processes ~3.7 pages/min compared to human annotators at ~4.3 pages/min, scaling to over twice the daily throughput in 24-hour operation at up to 315x lower cost per data point using current GPT API pricing.

Weighted F1 scores and Levenshtein distances are consistently used as proxies for human interpretability, with robust benchmarking and reporting. Performance is robust to most text quality issues, but suffers in documents with severe OCR or language irregularities. Outlier flagging and validation empower quality-centric usage scenarios.

Case Study: Extraction for 100 Spider Species

ARETE was applied to retrieve species occurrence data for 100 randomly selected spider species from the World Spider Catalog, compared to GBIF records:

ARETE identified referenced data for 67% of sampled species vs. 60% in GBIF, with 44% overlap.
When considering precise long-lat entries, coverage is reduced (62%), commensurate with the rate of such data in English texts.
The inclusion of extracted records leads to a mean increase in EOO by three orders of magnitude (mean factor ~1949.84), sufficient in 41% of cases (9/22 species) to alter IUCN threat assessments using criterion B.
Most spatial data additions derive from literature not previously digitized in GBIF, resolving biases and highlighting substantial improvement in distributional knowledge.
The predominant errors remain centered around problematic source papers with historic OCR defects or poor formatting.

Practical Implications and Limitations

ARETE's modular architecture represents a scalable solution for augmenting biodiversity databases with literature-derived occurrence records. The system is competitive with expert annotators in ideal cases and substantially more cost-effective at scale. Explicit error classification (true/false positives/negatives) and outlier flagging empower users to weigh risks relevant to conservation prioritization, favoring precision to mitigate de-escalation risk from overestimated distributions.

Supporting tools for further validation (inter/intra-rater reliability), greater language coverage, automated taxonomy normalization, and support for additional LLMs (open-source and commercial APIs) are under active development. The limitation remains in English-language data, reliance on proprietary APIs, and inherent non-transparency of proprietary black-box LLMs. Taxonomic synonymy and conceptual document interpretation remain areas requiring extensions beyond current capabilities.

Theoretical Significance and Future Prospects

The results substantiate that fine-tuned LLMs, when coupled with robust prompts and downstream QA analysis, can systematically extract complex relational biogeographic data from unstructured literature at scale. The approach is extensible to trait and ecological data, with implications for resolving historical data gaps and reducing the Wallacean shortfall across taxa. Extension towards open-source LLM integration, multilingual extraction, and taxonomy-aware normalization will further lower adoption barriers and improve coverage.

The practical increase in distribution knowledge directly affects threat modeling and conservation planning, demonstrating the capacity of AI-powered tools to reshape data-driven ecological inference, provided sufficient QA controls are maintained.

Conclusion

ARETE is a validated, high-performance R package for automated species occurrence extraction using LLMs. By abstracting complex text-to-data pipelines and integrating fine-tuning, outlier analysis, and rigorous validation, it enables rapid, cost-effective mobilization of occurrence data locked in literature. With its impact demonstrated for spiders, ARETE is poised to significantly enhance biodiversity informatics workflows, pending continued expansion of taxonomy- and language-aware features and integration with open-source AI systems.

Markdown Report Issue