PRISMA-DFLLM: LLM-Enhanced Systematic Reviews

Updated 21 February 2026

PRISMA-DFLLM is an extension of PRISMA guidelines that integrates domain-specific, fine-tuned LLMs to automate and update systematic reviews.
It employs a modular multi-agent system that validates protocols, assesses methodologies, and computes compliance scores for transparent reporting.
The framework enhances evidence synthesis efficiency with reproducible metrics, achieving up to 84% exact match with expert evaluations.

PRISMA-DFLLM refers to the extension of traditional systematic literature review (SLR) guidelines and methodologies to incorporate domain-specific, fine-tuned LLMs throughout the SLR process. Building on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 standards, PRISMA-DFLLM specifies technical, methodological, and reporting requirements for transparent, reproducible, and robust integration of LLM-based tools into evidence synthesis workflows, supporting "living" systematic reviews and domain-adapted automation (Mushtaq et al., 21 Sep 2025, Susnjak, 2023).

1. Definition, Motivation, and Scope

PRISMA-DFLLM is an explicit extension of the PRISMA 2020 reporting guidelines for SLRs, incorporating end-to-end use of domain-specific, fine-tuned LLMs. It establishes additional reporting categories for the construction and use of fine-tuned models, including dataset creation, LLM fine-tuning processes, model evaluation, and ethical/legal considerations.

The rationale for PRISMA-DFLLM is twofold. First, general-purpose LLMs (e.g., GPT-3.5, LLaMA) lack sufficient coverage of domain-specialized terminologies and methodologies. They may hallucinate or omit critical evidence, limiting reliability for fields requiring precise synthesis. Second, integrating LLMs fine-tuned on the papers identified in a rigorous SLR pipeline yields expert systems capable of nuanced extraction, robust updating, and incremental knowledge synthesis—directly supporting the paradigm of living reviews (Susnjak, 2023).

2. Technical Architecture and Workflow

PRISMA-DFLLM operationalizes model usage in the SLR lifecycle through a modular, interpretable multi-agent system. Four primary agents are orchestrated by a Coordinator module:

Protocol Validator: Assesses the match between the review's protocol and PRISMA items 1–5 (title, abstract, registration, eligibility).
Methodology Assessor: Audits methods sections per PRISMA 6–16 (information sources, search strategy, data collection, risk of bias, etc.).
Topic Relevance Checker: Validates thematic coherence vs. PRISMA 3–4, 17–23 (PICO/PECO alignment, objectives).
Reporting Completeness Officer: Reviews results, discussion, and funding (PRISMA 19–27).

The Coordinator splits SLR content by PRISMA checklist item, dispatches section-task tuples to agents, aggregates binary outputs, and computes overall PRISMA compliance score as

$\text{PRISMA\_Score} = \frac{1}{27} \sum_{i=1}^{27}s_i, \quad s_i\in\{0,1\}$

A general prompt schema is used by each agent (“Does the text satisfy PRISMA item {i}?” with itemized instructions and output format), facilitating modular prompt engineering and domain adaptation (Mushtaq et al., 21 Sep 2025).

Example Prompt Template:

You are the <AgentRole> assigned to evaluate PRISMA checklist item #{i}: "<ChecklistItemDescription>".
Here is the extracted text from the systematic review:
{SectionText}
Question: Does the text satisfy PRISMA item {i}?
Answer with:
 - score: 1 if YES, 0 if NO
 - brief justification
Format your output as:
score: <0|1>
justification: <your text>

The technical workflow for end-to-end PRISMA-DFLLM SLR involves:

Data Preparation: PRISMA-compliant reference search and screening → PDF extraction (tools such as pdfminer, Grobid) → metadata and section segmentation.
Fine-Tuning: Applying PEFT methods (LoRA, Adapters, QLoRA) on filtered corpora; composite loss formulation for multi-task SLR objectives.
Deployment: Agents ingest review sections, apply checklist-linked evaluation, and output binary scores plus justifications.
Aggregation: Overall score computation, inter-agent consistency checks, and optional escalation to human review.

3. Extended PRISMA-DFLLM Checklist

PRISMA-DFLLM introduces new categories to augment the PRISMA checklist with LLM-specific requirements. Items 1–15 are unchanged; items 16–31 capture dataset provenance, model finetuning, evaluation, and compliance:

Item	Focus	Example Documentation
16	Finetuning Dataset	Preprocessing, format, augmentation, composition
17	LLM Finetuning	Model specs, PEFT strategy, training settings, post-FT
18	LLM Evaluation	Perplexity, stability, qualitative analysis, metrics
31	LLM Legal/Ethical	Ethical/Legal implications, compliance, licensing

Illustrative snippets:

16a: "Cleaned PDFs with Grobid v0.7; stripped citations via regex ‘[\d+]’."
17b: "QLoRA: 4-bit quantization + LoRA adapters, rank 4."
18a: "Baseline perplexity on domain test set: 16.2; post-FT perplexity: 8.6."
31b: "All papers open access or fair-use licensed. Permissions obtained where required." (Susnjak, 2023).

This checklist ensures rigorous transparency for any SLR leveraging LLM-driven automation.

4. Empirical Validation and Evaluation

Empirical benchmarks of PRISMA-DFLLM frameworks demonstrate robust alignment with human annotators. In an initial study on five SLRs across Medicine, Computer Science, Environmental Science, Psychology, and Engineering, the multi-agent system achieved:

84% exact match with expert PRISMA item-level labels
Cohen's $\kappa = 0.75$ (substantial agreement)
Category-level accuracies: Title/Abstract 88%, Methods 82%, Results 80%, Discussion & Limitations 86%, Funding 90%

Validation relied on comparison to double-independent human assessment with majority-vote adjudication. Category-level performance below 80% is flagged for targeted prompt refinement (Mushtaq et al., 21 Sep 2025).

LLM evaluation, as specified by PRISMA-DFLLM, includes perplexity audits, variance across seeds, human rating of generated content, retrieval/summarization metrics (precision@10, ROUGE-1/2/L), and qualitative assessment of factual consistency.

5. Implementation Models and Dynamic Extensions

PRISMA-DFLLM supports both fixed and dynamic deployment models. The technical stack comprises Python ≥ 3.9, PyTorch ≥ 1.12, HuggingFace Transformers & PEFT libraries, pdfminer/Grobid, and FAISS for similarity-based retrieval and update of domain exemplars (Susnjak, 2023).

Dynamic Few-Shot Prompting is enabled by maintaining item- and domain-specific exemplar pools, with embedding similarity used to retrieve in-context examples. Agents may incorporate chain-of-thought and self-consistency strategies (multiple sampling with majority voting) to boost reliability, especially on borderline decisions. An Aggregator agent ensures global consistency and resolves checklist response conflicts; a JSON schema allows seamless checklist extension and template management.

The modular design enables rapid adaptation to new domains (e.g., ecology) via minimal additional in-domain exemplars, and only requires updating the checklist schema and prompts, not core agent code (Mushtaq et al., 21 Sep 2025).

6. Challenges, Advantages, and Future Directions

Challenges

PDF Extraction: Accurate parsing of sectioned academic PDF content, especially for tables/figures, remains nontrivial. Integrated vision-NLP tools or human-in-loop pipelines are recommended.
PEFT Optimization: Trade-offs between LoRA, QLoRA, and Adapter approaches require validation and possible meta-learning for automated selection.
Alignment and Bias: Hallucinations and latent bias (e.g., gender/ethnic skew) necessitate adversarial bias audits and corpus diversification.
Legal Compliance: Restrictions surrounding paywalled content, licensing, and GDPR compliance must be formally tracked and documented (cf. checklist item 31) (Susnjak, 2023).

Advantages

Efficiency: Automation of screening, data extraction, and synthesis reduces typical SLR timelines from months to weeks.
Reusability and Scalability: Finetuned models and datasets become shareable research assets, scaling effortlessly to thousands of documents.
Living Reviews: Embedding-based retrieval enables rapid incremental updating and model adaptation with minimal additional training.
Modularity and Interpretability: Agentic design supports explainable outputs, fine-grained debugging, and robust, reproducible reporting (Mushtaq et al., 21 Sep 2025).

Future Research

Immediate methodological priorities include robust PDF-to-text/table curation tools, benchmarking PEFT techniques on SLR tasks, and augmenting human-in-loop evaluation. Medium-term goals incorporate ensemble adapters and active learning for domain knowledge refinement. Longer-term challenges entail seamless visual data ingestion, automated PEFT meta-learning, and frameworks for legal/ethical compliance in public LLM distribution (Susnjak, 2023).

7. Workflow Overview

The standard PRISMA-DFLLM pipeline is visualized below:

┌────────┐    ┌─────────┐      ┌───────────────┐
│PRISMA  │─▶  │Screening│──▶   │Included Papers│
│Search  │    │Assistant│      │    (PDF)      │
└────────┘    └────┬────┘      └─────┬─────────┘
         extract & clean         │
                                 ▼
                        ┌────────────┐
                        │ Preprocess │
                        │(Text+Meta) │
                        └────┬───────┘
                             │
                             ▼
                     ┌────────────────┐
                     │ Fine-tune LLM  │
                     └───┬────────────┘
                         │
                         ▼
             ┌──────────────────────┐
             │ Indexed Embeddings   │◀────────────┐
             │ & Model Artifacts    │             │
             └─────────┬────────────┘             │
                       │           queries        │
                       ▼                          │
       ┌────────────┐┌─────────────────┐          │
       │ Retrieval  ││ Summarization   │          │
       │ Engine     ││ & Extraction    │          │
       └────────────┘└───────┬─────────┘          │
                             │                    │
                             ▼                    │
                        ┌─────────────┐           │
                        │ Synthesis   │           │
                        │(Meta-       │           │
                        │ analysis)   │           │
                        └─────────────┘           │

This outlines the integration of LLMs at every stage, from initial inclusion/exclusion to synthesis and reporting (Susnjak, 2023).

PRISMA-DFLLM constitutes a comprehensive, modular approach for leveraging domain-adapted LLMs in systematic review practice, codified by extensions to both technical workflows and reporting standards. The result is a transparent, scalable, and auditable foundation for next-generation evidence synthesis, with empirical support for reliability and adaptability across disciplines (Mushtaq et al., 21 Sep 2025, Susnjak, 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Can Agents Judge Systematic Reviews Like Humans? Evaluating SLRs with LLM-based Multi-Agent System (2025)

PRISMA-DFLLM: An Extension of PRISMA for Systematic Literature Reviews using Domain-specific Finetuned Large Language Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PRISMA-DFLLM.