Papers
Topics
Authors
Recent
Search
2000 character limit reached

PRISMA-DFLLM: LLM-Enhanced Systematic Reviews

Updated 21 February 2026
  • PRISMA-DFLLM is an extension of PRISMA guidelines that integrates domain-specific, fine-tuned LLMs to automate and update systematic reviews.
  • It employs a modular multi-agent system that validates protocols, assesses methodologies, and computes compliance scores for transparent reporting.
  • The framework enhances evidence synthesis efficiency with reproducible metrics, achieving up to 84% exact match with expert evaluations.

PRISMA-DFLLM refers to the extension of traditional systematic literature review (SLR) guidelines and methodologies to incorporate domain-specific, fine-tuned LLMs throughout the SLR process. Building on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 standards, PRISMA-DFLLM specifies technical, methodological, and reporting requirements for transparent, reproducible, and robust integration of LLM-based tools into evidence synthesis workflows, supporting "living" systematic reviews and domain-adapted automation (Mushtaq et al., 21 Sep 2025, Susnjak, 2023).

1. Definition, Motivation, and Scope

PRISMA-DFLLM is an explicit extension of the PRISMA 2020 reporting guidelines for SLRs, incorporating end-to-end use of domain-specific, fine-tuned LLMs. It establishes additional reporting categories for the construction and use of fine-tuned models, including dataset creation, LLM fine-tuning processes, model evaluation, and ethical/legal considerations.

The rationale for PRISMA-DFLLM is twofold. First, general-purpose LLMs (e.g., GPT-3.5, LLaMA) lack sufficient coverage of domain-specialized terminologies and methodologies. They may hallucinate or omit critical evidence, limiting reliability for fields requiring precise synthesis. Second, integrating LLMs fine-tuned on the papers identified in a rigorous SLR pipeline yields expert systems capable of nuanced extraction, robust updating, and incremental knowledge synthesis—directly supporting the paradigm of living reviews (Susnjak, 2023).

2. Technical Architecture and Workflow

PRISMA-DFLLM operationalizes model usage in the SLR lifecycle through a modular, interpretable multi-agent system. Four primary agents are orchestrated by a Coordinator module:

  • Protocol Validator: Assesses the match between the review's protocol and PRISMA items 1–5 (title, abstract, registration, eligibility).
  • Methodology Assessor: Audits methods sections per PRISMA 6–16 (information sources, search strategy, data collection, risk of bias, etc.).
  • Topic Relevance Checker: Validates thematic coherence vs. PRISMA 3–4, 17–23 (PICO/PECO alignment, objectives).
  • Reporting Completeness Officer: Reviews results, discussion, and funding (PRISMA 19–27).

The Coordinator splits SLR content by PRISMA checklist item, dispatches section-task tuples to agents, aggregates binary outputs, and computes overall PRISMA compliance score as

PRISMA_Score=127i=127si,si{0,1}\text{PRISMA\_Score} = \frac{1}{27} \sum_{i=1}^{27}s_i, \quad s_i\in\{0,1\}

A general prompt schema is used by each agent (“Does the text satisfy PRISMA item {i}?” with itemized instructions and output format), facilitating modular prompt engineering and domain adaptation (Mushtaq et al., 21 Sep 2025).

Example Prompt Template:

1
2
3
4
5
6
7
8
9
10
You are the <AgentRole> assigned to evaluate PRISMA checklist item #{i}: "<ChecklistItemDescription>".
Here is the extracted text from the systematic review:
{SectionText}
Question: Does the text satisfy PRISMA item {i}?
Answer with:
 - score: 1 if YES, 0 if NO
 - brief justification
Format your output as:
score: <0|1>
justification: <your text>

The technical workflow for end-to-end PRISMA-DFLLM SLR involves:

  1. Data Preparation: PRISMA-compliant reference search and screening → PDF extraction (tools such as pdfminer, Grobid) → metadata and section segmentation.
  2. Fine-Tuning: Applying PEFT methods (LoRA, Adapters, QLoRA) on filtered corpora; composite loss formulation for multi-task SLR objectives.
  3. Deployment: Agents ingest review sections, apply checklist-linked evaluation, and output binary scores plus justifications.
  4. Aggregation: Overall score computation, inter-agent consistency checks, and optional escalation to human review.

3. Extended PRISMA-DFLLM Checklist

PRISMA-DFLLM introduces new categories to augment the PRISMA checklist with LLM-specific requirements. Items 1–15 are unchanged; items 16–31 capture dataset provenance, model finetuning, evaluation, and compliance:

Item Focus Example Documentation
16 Finetuning Dataset Preprocessing, format, augmentation, composition
17 LLM Finetuning Model specs, PEFT strategy, training settings, post-FT
18 LLM Evaluation Perplexity, stability, qualitative analysis, metrics
31 LLM Legal/Ethical Ethical/Legal implications, compliance, licensing

Illustrative snippets:

  • 16a: "Cleaned PDFs with Grobid v0.7; stripped citations via regex ‘[\d+]’."
  • 17b: "QLoRA: 4-bit quantization + LoRA adapters, rank 4."
  • 18a: "Baseline perplexity on domain test set: 16.2; post-FT perplexity: 8.6."
  • 31b: "All papers open access or fair-use licensed. Permissions obtained where required." (Susnjak, 2023).

This checklist ensures rigorous transparency for any SLR leveraging LLM-driven automation.

4. Empirical Validation and Evaluation

Empirical benchmarks of PRISMA-DFLLM frameworks demonstrate robust alignment with human annotators. In an initial study on five SLRs across Medicine, Computer Science, Environmental Science, Psychology, and Engineering, the multi-agent system achieved:

  • 84% exact match with expert PRISMA item-level labels
  • Cohen's κ=0.75\kappa = 0.75 (substantial agreement)
  • Category-level accuracies: Title/Abstract 88%, Methods 82%, Results 80%, Discussion & Limitations 86%, Funding 90%

Validation relied on comparison to double-independent human assessment with majority-vote adjudication. Category-level performance below 80% is flagged for targeted prompt refinement (Mushtaq et al., 21 Sep 2025).

LLM evaluation, as specified by PRISMA-DFLLM, includes perplexity audits, variance across seeds, human rating of generated content, retrieval/summarization metrics (precision@10, ROUGE-1/2/L), and qualitative assessment of factual consistency.

5. Implementation Models and Dynamic Extensions

PRISMA-DFLLM supports both fixed and dynamic deployment models. The technical stack comprises Python ≥ 3.9, PyTorch ≥ 1.12, HuggingFace Transformers & PEFT libraries, pdfminer/Grobid, and FAISS for similarity-based retrieval and update of domain exemplars (Susnjak, 2023).

Dynamic Few-Shot Prompting is enabled by maintaining item- and domain-specific exemplar pools, with embedding similarity used to retrieve in-context examples. Agents may incorporate chain-of-thought and self-consistency strategies (multiple sampling with majority voting) to boost reliability, especially on borderline decisions. An Aggregator agent ensures global consistency and resolves checklist response conflicts; a JSON schema allows seamless checklist extension and template management.

The modular design enables rapid adaptation to new domains (e.g., ecology) via minimal additional in-domain exemplars, and only requires updating the checklist schema and prompts, not core agent code (Mushtaq et al., 21 Sep 2025).

6. Challenges, Advantages, and Future Directions

Challenges

  • PDF Extraction: Accurate parsing of sectioned academic PDF content, especially for tables/figures, remains nontrivial. Integrated vision-NLP tools or human-in-loop pipelines are recommended.
  • PEFT Optimization: Trade-offs between LoRA, QLoRA, and Adapter approaches require validation and possible meta-learning for automated selection.
  • Alignment and Bias: Hallucinations and latent bias (e.g., gender/ethnic skew) necessitate adversarial bias audits and corpus diversification.
  • Legal Compliance: Restrictions surrounding paywalled content, licensing, and GDPR compliance must be formally tracked and documented (cf. checklist item 31) (Susnjak, 2023).

Advantages

  • Efficiency: Automation of screening, data extraction, and synthesis reduces typical SLR timelines from months to weeks.
  • Reusability and Scalability: Finetuned models and datasets become shareable research assets, scaling effortlessly to thousands of documents.
  • Living Reviews: Embedding-based retrieval enables rapid incremental updating and model adaptation with minimal additional training.
  • Modularity and Interpretability: Agentic design supports explainable outputs, fine-grained debugging, and robust, reproducible reporting (Mushtaq et al., 21 Sep 2025).

Future Research

Immediate methodological priorities include robust PDF-to-text/table curation tools, benchmarking PEFT techniques on SLR tasks, and augmenting human-in-loop evaluation. Medium-term goals incorporate ensemble adapters and active learning for domain knowledge refinement. Longer-term challenges entail seamless visual data ingestion, automated PEFT meta-learning, and frameworks for legal/ethical compliance in public LLM distribution (Susnjak, 2023).

7. Workflow Overview

The standard PRISMA-DFLLM pipeline is visualized below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
┌────────┐    ┌─────────┐      ┌───────────────┐
│PRISMA  │─▶  │Screening│──▶   │Included Papers│
│Search  │    │Assistant│      │    (PDF)      │
└────────┘    └────┬────┘      └─────┬─────────┘
         extract & clean         │
                                 ▼
                        ┌────────────┐
                        │ Preprocess │
                        │(Text+Meta) │
                        └────┬───────┘
                             │
                             ▼
                     ┌────────────────┐
                     │ Fine-tune LLM  │
                     └───┬────────────┘
                         │
                         ▼
             ┌──────────────────────┐
             │ Indexed Embeddings   │◀────────────┐
             │ & Model Artifacts    │             │
             └─────────┬────────────┘             │
                       │           queries        │
                       ▼                          │
       ┌────────────┐┌─────────────────┐          │
       │ Retrieval  ││ Summarization   │          │
       │ Engine     ││ & Extraction    │          │
       └────────────┘└───────┬─────────┘          │
                             │                    │
                             ▼                    │
                        ┌─────────────┐           │
                        │ Synthesis   │           │
                        │(Meta-       │           │
                        │ analysis)   │           │
                        └─────────────┘           │

This outlines the integration of LLMs at every stage, from initial inclusion/exclusion to synthesis and reporting (Susnjak, 2023).


PRISMA-DFLLM constitutes a comprehensive, modular approach for leveraging domain-adapted LLMs in systematic review practice, codified by extensions to both technical workflows and reporting standards. The result is a transparent, scalable, and auditable foundation for next-generation evidence synthesis, with empirical support for reliability and adaptability across disciplines (Mushtaq et al., 21 Sep 2025, Susnjak, 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PRISMA-DFLLM.