PRISMA-DFLLM: LLM-Enhanced Systematic Reviews
- PRISMA-DFLLM is an extension of PRISMA guidelines that integrates domain-specific, fine-tuned LLMs to automate and update systematic reviews.
- It employs a modular multi-agent system that validates protocols, assesses methodologies, and computes compliance scores for transparent reporting.
- The framework enhances evidence synthesis efficiency with reproducible metrics, achieving up to 84% exact match with expert evaluations.
PRISMA-DFLLM refers to the extension of traditional systematic literature review (SLR) guidelines and methodologies to incorporate domain-specific, fine-tuned LLMs throughout the SLR process. Building on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 standards, PRISMA-DFLLM specifies technical, methodological, and reporting requirements for transparent, reproducible, and robust integration of LLM-based tools into evidence synthesis workflows, supporting "living" systematic reviews and domain-adapted automation (Mushtaq et al., 21 Sep 2025, Susnjak, 2023).
1. Definition, Motivation, and Scope
PRISMA-DFLLM is an explicit extension of the PRISMA 2020 reporting guidelines for SLRs, incorporating end-to-end use of domain-specific, fine-tuned LLMs. It establishes additional reporting categories for the construction and use of fine-tuned models, including dataset creation, LLM fine-tuning processes, model evaluation, and ethical/legal considerations.
The rationale for PRISMA-DFLLM is twofold. First, general-purpose LLMs (e.g., GPT-3.5, LLaMA) lack sufficient coverage of domain-specialized terminologies and methodologies. They may hallucinate or omit critical evidence, limiting reliability for fields requiring precise synthesis. Second, integrating LLMs fine-tuned on the papers identified in a rigorous SLR pipeline yields expert systems capable of nuanced extraction, robust updating, and incremental knowledge synthesis—directly supporting the paradigm of living reviews (Susnjak, 2023).
2. Technical Architecture and Workflow
PRISMA-DFLLM operationalizes model usage in the SLR lifecycle through a modular, interpretable multi-agent system. Four primary agents are orchestrated by a Coordinator module:
- Protocol Validator: Assesses the match between the review's protocol and PRISMA items 1–5 (title, abstract, registration, eligibility).
- Methodology Assessor: Audits methods sections per PRISMA 6–16 (information sources, search strategy, data collection, risk of bias, etc.).
- Topic Relevance Checker: Validates thematic coherence vs. PRISMA 3–4, 17–23 (PICO/PECO alignment, objectives).
- Reporting Completeness Officer: Reviews results, discussion, and funding (PRISMA 19–27).
The Coordinator splits SLR content by PRISMA checklist item, dispatches section-task tuples to agents, aggregates binary outputs, and computes overall PRISMA compliance score as
A general prompt schema is used by each agent (“Does the text satisfy PRISMA item {i}?” with itemized instructions and output format), facilitating modular prompt engineering and domain adaptation (Mushtaq et al., 21 Sep 2025).
Example Prompt Template:
1 2 3 4 5 6 7 8 9 10 |
You are the <AgentRole> assigned to evaluate PRISMA checklist item #{i}: "<ChecklistItemDescription>".
Here is the extracted text from the systematic review:
{SectionText}
Question: Does the text satisfy PRISMA item {i}?
Answer with:
- score: 1 if YES, 0 if NO
- brief justification
Format your output as:
score: <0|1>
justification: <your text> |
The technical workflow for end-to-end PRISMA-DFLLM SLR involves:
- Data Preparation: PRISMA-compliant reference search and screening → PDF extraction (tools such as pdfminer, Grobid) → metadata and section segmentation.
- Fine-Tuning: Applying PEFT methods (LoRA, Adapters, QLoRA) on filtered corpora; composite loss formulation for multi-task SLR objectives.
- Deployment: Agents ingest review sections, apply checklist-linked evaluation, and output binary scores plus justifications.
- Aggregation: Overall score computation, inter-agent consistency checks, and optional escalation to human review.
3. Extended PRISMA-DFLLM Checklist
PRISMA-DFLLM introduces new categories to augment the PRISMA checklist with LLM-specific requirements. Items 1–15 are unchanged; items 16–31 capture dataset provenance, model finetuning, evaluation, and compliance:
| Item | Focus | Example Documentation |
|---|---|---|
| 16 | Finetuning Dataset | Preprocessing, format, augmentation, composition |
| 17 | LLM Finetuning | Model specs, PEFT strategy, training settings, post-FT |
| 18 | LLM Evaluation | Perplexity, stability, qualitative analysis, metrics |
| 31 | LLM Legal/Ethical | Ethical/Legal implications, compliance, licensing |
Illustrative snippets:
- 16a: "Cleaned PDFs with Grobid v0.7; stripped citations via regex ‘[\d+]’."
- 17b: "QLoRA: 4-bit quantization + LoRA adapters, rank 4."
- 18a: "Baseline perplexity on domain test set: 16.2; post-FT perplexity: 8.6."
- 31b: "All papers open access or fair-use licensed. Permissions obtained where required." (Susnjak, 2023).
This checklist ensures rigorous transparency for any SLR leveraging LLM-driven automation.
4. Empirical Validation and Evaluation
Empirical benchmarks of PRISMA-DFLLM frameworks demonstrate robust alignment with human annotators. In an initial study on five SLRs across Medicine, Computer Science, Environmental Science, Psychology, and Engineering, the multi-agent system achieved:
- 84% exact match with expert PRISMA item-level labels
- Cohen's (substantial agreement)
- Category-level accuracies: Title/Abstract 88%, Methods 82%, Results 80%, Discussion & Limitations 86%, Funding 90%
Validation relied on comparison to double-independent human assessment with majority-vote adjudication. Category-level performance below 80% is flagged for targeted prompt refinement (Mushtaq et al., 21 Sep 2025).
LLM evaluation, as specified by PRISMA-DFLLM, includes perplexity audits, variance across seeds, human rating of generated content, retrieval/summarization metrics (precision@10, ROUGE-1/2/L), and qualitative assessment of factual consistency.
5. Implementation Models and Dynamic Extensions
PRISMA-DFLLM supports both fixed and dynamic deployment models. The technical stack comprises Python ≥ 3.9, PyTorch ≥ 1.12, HuggingFace Transformers & PEFT libraries, pdfminer/Grobid, and FAISS for similarity-based retrieval and update of domain exemplars (Susnjak, 2023).
Dynamic Few-Shot Prompting is enabled by maintaining item- and domain-specific exemplar pools, with embedding similarity used to retrieve in-context examples. Agents may incorporate chain-of-thought and self-consistency strategies (multiple sampling with majority voting) to boost reliability, especially on borderline decisions. An Aggregator agent ensures global consistency and resolves checklist response conflicts; a JSON schema allows seamless checklist extension and template management.
The modular design enables rapid adaptation to new domains (e.g., ecology) via minimal additional in-domain exemplars, and only requires updating the checklist schema and prompts, not core agent code (Mushtaq et al., 21 Sep 2025).
6. Challenges, Advantages, and Future Directions
Challenges
- PDF Extraction: Accurate parsing of sectioned academic PDF content, especially for tables/figures, remains nontrivial. Integrated vision-NLP tools or human-in-loop pipelines are recommended.
- PEFT Optimization: Trade-offs between LoRA, QLoRA, and Adapter approaches require validation and possible meta-learning for automated selection.
- Alignment and Bias: Hallucinations and latent bias (e.g., gender/ethnic skew) necessitate adversarial bias audits and corpus diversification.
- Legal Compliance: Restrictions surrounding paywalled content, licensing, and GDPR compliance must be formally tracked and documented (cf. checklist item 31) (Susnjak, 2023).
Advantages
- Efficiency: Automation of screening, data extraction, and synthesis reduces typical SLR timelines from months to weeks.
- Reusability and Scalability: Finetuned models and datasets become shareable research assets, scaling effortlessly to thousands of documents.
- Living Reviews: Embedding-based retrieval enables rapid incremental updating and model adaptation with minimal additional training.
- Modularity and Interpretability: Agentic design supports explainable outputs, fine-grained debugging, and robust, reproducible reporting (Mushtaq et al., 21 Sep 2025).
Future Research
Immediate methodological priorities include robust PDF-to-text/table curation tools, benchmarking PEFT techniques on SLR tasks, and augmenting human-in-loop evaluation. Medium-term goals incorporate ensemble adapters and active learning for domain knowledge refinement. Longer-term challenges entail seamless visual data ingestion, automated PEFT meta-learning, and frameworks for legal/ethical compliance in public LLM distribution (Susnjak, 2023).
7. Workflow Overview
The standard PRISMA-DFLLM pipeline is visualized below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
┌────────┐ ┌─────────┐ ┌───────────────┐
│PRISMA │─▶ │Screening│──▶ │Included Papers│
│Search │ │Assistant│ │ (PDF) │
└────────┘ └────┬────┘ └─────┬─────────┘
extract & clean │
▼
┌────────────┐
│ Preprocess │
│(Text+Meta) │
└────┬───────┘
│
▼
┌────────────────┐
│ Fine-tune LLM │
└───┬────────────┘
│
▼
┌──────────────────────┐
│ Indexed Embeddings │◀────────────┐
│ & Model Artifacts │ │
└─────────┬────────────┘ │
│ queries │
▼ │
┌────────────┐┌─────────────────┐ │
│ Retrieval ││ Summarization │ │
│ Engine ││ & Extraction │ │
└────────────┘└───────┬─────────┘ │
│ │
▼ │
┌─────────────┐ │
│ Synthesis │ │
│(Meta- │ │
│ analysis) │ │
└─────────────┘ │ |
This outlines the integration of LLMs at every stage, from initial inclusion/exclusion to synthesis and reporting (Susnjak, 2023).
PRISMA-DFLLM constitutes a comprehensive, modular approach for leveraging domain-adapted LLMs in systematic review practice, codified by extensions to both technical workflows and reporting standards. The result is a transparent, scalable, and auditable foundation for next-generation evidence synthesis, with empirical support for reliability and adaptability across disciplines (Mushtaq et al., 21 Sep 2025, Susnjak, 2023).