PRISMA-Aligned Systematic Reviews
- PRISMA-aligned systematic reviews are evidence synthesis methods that strictly follow PRISMA guidelines to ensure transparency, reproducibility, and auditability in research.
- They employ a multi-agent architecture, including protocol validators, topic relevance checkers, duplicate detectors, and methodology assessors to automate study selection and evaluation.
- The approach enhances reproducibility by generating quantifiable compliance scores while supporting human oversight for nuanced criteria and error analysis.
A PRISMA-Aligned Systematic Review is an evidence synthesis methodology in which the collection, screening, appraisal, and reporting of primary research studies adheres explicitly to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. PRISMA-aligned reviews are designed to maximize transparency, reproducibility, and interpretability, ensuring that inclusion/exclusion of studies, extraction of data, and synthesis of findings are standardized and auditable across disciplines. In recent years, developments in computational methods—including multi-agent LLM systems and explainable-AI-enhanced evaluation pipelines—have further operationalized PRISMA compliance, enabling both human and machine-driven assessment of SLR quality and protocol fidelity (Mushtaq et al., 21 Sep 2025). The following sections elaborate the methodological foundations, agent architectures, checklist mappings, evaluation logic, applied metrics, and analytical results characteristic of PRISMA-aligned reviews, with particular reference to computational agent frameworks.
1. Architectural Foundations of PRISMA-Aligned Systematic Reviews
A PRISMA-aligned systematic review is composed of four canonical macro-phases: identification, screening, eligibility, and inclusion. Recent LLM-driven copilot systems implement these phases as a modular multi-agent architecture (Mushtaq et al., 21 Sep 2025):
- Protocol Validator: Assesses whether the SLR protocol specifies background, objectives, eligibility criteria, and registration (recording decisions for key PRISMA items 1, 2, 5, 24).
- Topic Relevance Checker: Evaluates the congruence of included studies with stated eligibility criteria and research objectives, mapping PRISMA items 3, 4, and 6.
- Duplicate Detector: Flags redundant or overlapping study records at the extraction phase (not a formal PRISMA item, but crucial for automation robustness).
- Methodology Assessor: Inspects SLR method, results, and discussion sections for conformance to PRISMA methodological requirements (covering items 7–22).
A central “Orchestrator” agent coordinates tasks in a fixed pipeline sequence—Protocol Validator executes first; Topic Relevance Checker and Duplicate Detector operate in parallel; finally, Methodology Assessor completes the evaluation. Each sub-agent emits structured reports (e.g., JSON {item_id: pass/fail/comment}) which are aggregated into a compliance vector available for human review or override.
2. Mapping PRISMA Checklist Items to Computational Agents
The PRISMA 2020 checklist contains 27 items; computational agent mapping selects a targeted, automatable subset:
| PRISMA Item | Responsible Agent | Coverage Detail / Example Prompt |
|---|---|---|
| 1, 2, 5, 24 | Protocol Validator | Checks for SLR title (“Systematic Review”), structured abstract, specified eligibility & registration |
| 3, 4, 6 | Topic Relevance Checker | Assesses rationale, objectives, and congruence of sources |
| Custom (dedup) | Duplicate Detector | Flags near-duplicates post-extraction |
| 7–22 | Methodology Assessor | Validates methods, bias assessment, synthesis, reporting biases, evidence certainty |
Prompts are designed to request binary pass/fail for each item, with optional comments detailing omissions. For example: Given the review protocol, does it specify (a) registration number/registry, (b) clear population, intervention, comparator, outcomes, and (c) timelines? Return yes/no and highlight missing elements.
For topic relevance: For each included study, does its topic, population, and intervention match the Objectives? Return pass/fail.
This mapping enables the division of labor in agentic systems and evaluative traceability across PRISMA domains.
3. Automated Evaluation Logic and Compliance Scoring
The evaluation logic for PRISMA-aligned reviews is operationalized as discrete item-wise assessment, typically following this pseudocode:
The sum may be restricted to subsets (e.g., only the Protocol Validator’s items) to compute compliance scores by review phase. A weighted variant incorporates item-level criticality:
Here, is the weight of item , reflecting substantive importance.
Aggregated vectors are returned to human supervisors who may override individual judgments, refine prompts, or supplement missing protocol details.
4. Evaluation Methodology and Performance Metrics
PRISMA-aligned agentic systems are benchmarked using double-annotated reference SLRs. In a reference implementation, five published SLRs spanning medicine, education, computer science, psychology, and environmental science were independently annotated by human experts and then passed through the agentic pipeline (Mushtaq et al., 21 Sep 2025).
Agreement between agent output and ground truth is reported as simple percent agreement:
No Cohen’s κ or other chance-corrected metric was reported (the system yielded 84% agreement with human annotators).
Agreement rates by agent:
| Agent | Agreement Rate (%) |
|---|---|
| Protocol Validator | 90 |
| Topic Relevance Checker | 82 |
| Duplicate Detector | 88 |
| Methodology Assessor | 80 |
Failure analysis is facilitated by agent comments. For example, an agent flagged a “fail” on Item 7 (search strategy) due to missing explicit date ranges, where the human rater marked “pass” (implicit in figures).
5. System Limitations, Human-in-the-Loop, and Failure Modes
Despite high overall concordance, PRISMA-aligned agentic evaluations are subject to several limitations:
- Domain-specific Jargon: Topic Relevance Checker can produce false positives/negatives when protocol jargon is ambiguous (e.g., including animal studies when only human subjects are eligible).
- Narrative Bias: Methodology Assessor may misinterpret narrative segments as formal bias assessments if structured tables are absent.
- Necessity of Human Oversight: The authors emphasize co-pilot status—not replacement—since subtle or context-dependent omissions (e.g., date ranges presented only visually) often require human intervention.
Extensibility to unexplored PRISMA items, variation in item criticality, and prompt optimization are not detailed in current agentic implementations and would require custom coding for precise replication.
6. Implications for PRISMA-Based Workflow Automation and Reproducibility
A PRISMA-aligned systematic review, especially when operationalized via multi-agent LLM frameworks, yields structured, reproducible, and interpretable compliance assessments with high concordance to human expert judgments (Mushtaq et al., 21 Sep 2025). The architecture enables modular auditability—segregating protocol validation, methodological rigor, and topical alignment—while retaining the necessary flexibility for human oversight at each decision junction.
However, full reproducibility of an agentic PRISMA-aligned pipeline would require:
- Publication of all agent prompt templates.
- Disclosure of orchestration logic (task order, concurrency).
- Transparent item mappings and any weighting schemes.
- Standardized reporting structure for compliance vectors and error cases.
This approach has been shown to approximate human PRISMA scoring in a domain-agnostic fashion, providing a scalable template for future interdisciplinary review automation.
References:
- Can Agents Judge Systematic Reviews Like Humans? Evaluating SLRs with LLM-based Multi-Agent System (Mushtaq et al., 21 Sep 2025)