WisPerMed Framework

Updated 17 February 2026

WisPerMed Framework is a modular, domain-adapted system for generating accurate and accessible biomedical texts using advanced LLMs.
It employs few-shot prompting, prompt engineering, and dynamic expert selection to optimize both readability and factuality metrics.
Evaluations in BioLaySumm2024 and BioNLP@ACL 2024 show improved ROUGE, BERT, and factuality scores over traditional single-model approaches.

The WisPerMed Framework is a modular, domain-adapted system for text generation in biomedical and healthcare applications, specifically addressing lay summarization of scientific articles and automated discharge summary generation from clinical records. Developed and evaluated in major shared tasks including BioLaySumm2024 and BioNLP@ACL 2024, WisPerMed combines instruction-tuned autoregressive LLMs with few-shot learning, advanced prompt engineering, and a Dynamic Expert Selection (DES) mechanism that leverages multiple readability and factuality metrics to select optimal outputs. The framework is built upon open-source models, quantized and adapted for high efficiency on modern hardware, and operationalizes a rigorously benchmarked pipeline for producing accessible and accurate biomedical texts (Pakull et al., 2024, Damm et al., 2024).

1. System Architecture

The WisPerMed framework is composed of several modular components that support end-to-end text generation in biomedical and clinical contexts:

Core components:

Data Ingestion and Preprocessing: For biomedical lay summarization, inputs are abstracts or full articles. In the discharge summary use case, MIMIC-IV Discharge Summaries (DS) are processed to extract “Brief Hospital Course” (BHC) and “Discharge Instructions” (DI) sections by regular expressions; the remaining context is used as input.
Model Zoo: Multiple pre-trained and fine-tuned LLMs form the core ensemble, notably BioMistral-7B-DARE (“BioM”), Llama 3 70B-Instruct (“Llama3”), WizardLM-2, Mistral-7B-Instruct-v0.2, OpenBioLLM-70B, and Phi-3-mini-128K.
Model Adaptation & Inference: Models are used in both fine-tuned and few-shot configurations. Inference produces multiple candidate summaries or document sections per input.
Scoring Modules: Metric computation includes readability (FKGL, DCRS, CLI), factuality (AlignScore, SummaC, MEDCON, METEOR), and standard summarization scores (ROUGE, BLEU, BERTScore).
Dynamic Expert Selection (DES): Post-hoc selection among model outputs uses normalized metric scores and weighted aggregation to select the candidate that best balances readability and factuality in a dataset- or task-specific manner (Pakull et al., 2024, Damm et al., 2024).

2. LLMs and Model Adaptation

WisPerMed utilizes open-source, autoregressive LLMs, predominantly:

BioMistral-7B-DARE (BioM): A 7B-parameter Mistral variant, adapted via continued pre-training on NIH PMC Open Access for biomedical reasoning.
Llama 3 70B-Instruct (Llama3): A 70B-parameter Meta release, instruction-tuned for broad generalization.

Additional models include WizardLM-2-8x22B for few-shot clinical tasks, Mistral-7B-Instruct-v0.2, Llama-3-8B-I, OpenBioLLM-70B, and Phi-3-mini-128K-Instruct for long-context records (Damm et al., 2024).

Fine-tuning approach:

Instruction Tuning: Models are adapted via a single epoch on task-specific pairs:
- Biomedical lay summary: PLOS + eLife article-summary pairs, with model-specific input length (abstract-only for BioM, full-article for Llama3).
- Discharge summary: MIMIC-IV DS context with separate training for BHC/DI sections, using domain-aligned system/user/assistant templates.
Quantized Low-Rank Adaptation (QLoRA/LoRA): 4-bit quantized, low-rank adapters (q_proj, k_proj, etc.) are trained with LoRA rank 16, α = 16, zero dropout. This configuration enables efficient adaptation on a single NVIDIA H100 80 GB GPU, and supports large model footprints.
Priming with Synthetic Data: Some variants use additional pre-instruction tuning (“priming”) on synthetic clinical notes (Asclepius dataset, ~158k samples), generated by ChatGPT-3.5-Turbo, to imbue clinical language style prior to domain fine-tuning (Damm et al., 2024).

Few-shot prompting:

Prompts are structured as conversation history with up to 10 demonstration examples for discharge instructions and up to 7 for BHC; demonstration selection is based on maximizing pre-calculated readability and factuality metrics (Pakull et al., 2024, Damm et al., 2024).

3. Prompt Engineering and Demonstration Selection

WisPerMed employs systematic prompt engineering, exploring the impact of different styles and contexts:

Prompt Variants (biomedical summarization):
- Initial Prompt: Basic “Abstract → Lay summary.”
- Persona Prompt: Uses a conversational persona (e.g., “Layla, your science communicator”).
- Intro Prompt: Includes article introduction alongside abstract.
- Guide Prompt: Supplements the abstract with explicit writing guidelines.
Performance effects: Persona prompts slightly improve ROUGE-1 relevance scores; guide prompts offer minor gains. Inclusion of introduction can decrease both relevance and readability in the absence of explicit guidance (Pakull et al., 2024).

Demonstration selection for few-shot prompting is conducted by ranking training/validation examples according to average readability (minimizing FKGL, DCRS, CLI) and factuality (maximizing AlignScore, SummaC) scores, then selecting the top-2 (for eLife) or top-3 (for PLOS) examples (Pakull et al., 2024). In the clinical text generation task, prompt templates explicitly enumerate structural elements (e.g., greeting, reason for hospitalization) and provide multiple annotated examples per context (Damm et al., 2024).

4. Dynamic Expert Selection (DES)

Dynamic Expert Selection (DES) is central to WisPerMed’s output quality. It operates as a downstream selection mechanism, reranking candidates from multiple models, prompt variants, or decoding runs to select the highest-quality output according to aggregate metric criteria (Pakull et al., 2024, Damm et al., 2024).

DES methodology:

For each candidate output, compute readability (FKGL, DCRS, CLI) and factuality (AlignScore, SummaC, MEDCON, METEOR) metrics.
Invert readability metrics (since lower is better), then apply min–max normalization across all candidates for a given input.
Compute the mean of normalized readability and factuality metrics for each candidate.
Aggregate these means by dataset- or task-specific weights (e.g., eLife: $w_R = 0.675$ , $w_F = 0.325$ ; PLOS: $w_R = 0.25$ , $w_F = 0.75$ ).
Selection score: $S_j = w_R R_j + w_F F_j$ ; select candidate maximizing $S_j$ .

In discharge summary generation, further DES variants are introduced, including:

Simple weighting of key factuality/readability metrics.
Penalization of readability for overly complex outputs.
Length-constrained selection to fit target range (e.g., 100–180 words for DI) (Damm et al., 2024).

A summary of candidate metric scoring and selection is provided in the following table:

Metric Type	Examples	DES Role
Readability	FKGL, DCRS, CLI	Reduced complexity, higher weights
Factuality	AlignScore, SummaC, MEDCON	Increased alignment/consistency
Relevance/Recall	ROUGE-{1,2,L}, BLEU	Post-hoc evaluation only

5. Training, Inference, and Evaluation Protocols

Training parameters:

Optimizer: AdamW 8bit, learning rate $2\times 10^{-4}$ , weight decay $0.01$, linear decay with 5 warmup steps.
Sequence length: up to 4096 tokens (biomedical summarization), up to 128,000 tokens (clinical records, Phi-3-mini).
Epochs: 1 (BioLaySumm), 2–3 (MIMIC, clinical), batch size 1–4.
Hardware: single NVIDIA H100 80 GB GPU; Phi-3-mini-128K trained on 3× H100s.

Inference settings:

Greedy decoding with maximum output length (1024 tokens for lay summaries).
Stochastic decoding (do_sample = True, temperature = 0.6, top_p = 0.9) for final discharge summary outputs; repetition_penalty and non_repeat_ngram_size configured for output diversity when required by DES (Pakull et al., 2024, Damm et al., 2024).

Evaluation protocols:

Biomedical Summarization: ROUGE-{1,2,L}, BERTScore (cosine similarity in BERT space), readability (FKGL, DCRS, CLI), LENS (text simplification), factuality (AlignScore, SummaC).
Discharge Summaries: BLEU-4, ROUGE, BERTScore, METEOR, AlignScore, MEDCON; overall score is mean across eight metrics.

Key Results:

Fine-tuned BioM achieves top relevance/factuality (ROUGE-1 = 0.470, BERT = 0.865, SummaC = 0.705) with DES post-processing further boosting factuality (SummaC = 0.722).
Few-shot learning increases factuality (SummaC from 0.458 to 0.604) but remains behind fine-tuned models on ROUGE/BERT metrics.
In Discharge Me! the best DES variant achieves an overall score of 0.332, outperforming all single-model outputs (Pakull et al., 2024, Damm et al., 2024).

6. Practical Considerations and Limitations

Strengths:

Lightweight QLoRA-based fine-tuning on domain-specific summary pairs yields substantial gains in output relevance and factuality.
Post-hoc DES enables robust selection across models/prompt strategies, consistently raising output quality by leveraging metric diversity.
Modular pipeline design and full environmental reporting (energy usage 1552.10 kWh, CO₂ 591.35 kg) support reproducibility and scalability (Damm et al., 2024).

Limitations:

Possible risk of memorization for small or repetitive datasets, especially in models like BioM.
Metric-driven DES, while effective for measured alignment/readability, may not fully capture user-perceived quality.
Fine-tuning is often constrained to a single epoch due to data size or compute considerations.

7. Extensions and Future Directions

Ongoing and proposed advances to the WisPerMed framework target further improvements in factual and accessible biomedical NLP:

Enhancement of demonstration selection via retrieval-augmented prompting.
Reinforcement learning from human feedback (RLHF) to directly optimize readability and factuality metrics.
Integration of external knowledge sources for improved explanation and context.
Joint optimization of summary length, consistency, and user engagement metrics.
Broader evaluation of DES weighting strategies and application to additional sections or styles of clinical documentation.

These avenues are presented as practical strategies to extend WisPerMed’s robust, modular, and efficiency-aware approach to automated, domain-adapted biomedical text generation (Pakull et al., 2024, Damm et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

WisPerMed at BioLaySumm: Adapting Autoregressive Large Language Models for Lay Summarization of Scientific Articles (2024)

WisPerMed at "Discharge Me!": Advancing Text Generation in Healthcare with Large Language Models, Dynamic Expert Selection, and Priming Techniques on MIMIC-IV (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WisPerMed Framework.