Controllable Biomedical Summarization

Updated 10 December 2025

Controllable biomedical summarization is a method that uses control signals to tailor text outputs based on audience expertise and readability requirements.
It leverages techniques like prompt engineering, architectural modifications, and reinforcement learning to adjust style and content for diverse user groups.
Evaluation employs metrics for relevance, readability, and factuality, driving improvements in patient education, interdisciplinary research, and clinical communication.

Controllable biomedical summarization is the class of natural language generation technologies and methodologies that enable the systematic production of biomedical text summaries tailored to specific user characteristics—most commonly, the reader’s domain knowledge, literacy, or information-seeking intent. Unlike generic summarization approaches that output static or “one-size-fits-all” synopses, controllable systems assign explicit or implicit control variables to specify stylistic, lexical, or content-level attributes of summary output, optimizing for target populations such as laypersons, students, or domain experts. This paradigm is particularly crucial in the biomedical domain, where terminology, conceptual complexity, and expected background knowledge can vary by orders of magnitude across audience segments (Salvi et al., 3 Dec 2025, Luo et al., 2022).

1. Problem Definition and Motivation

Biomedical literature routinely incorporates specialized lexicons, dense methodological detail, and domain-specific reporting conventions. Traditional automatic summarization systems, whether extractive or generic abstractive, underperform in conveying the same research to heterogeneous audiences, either diluting technical content for experts or leaving core findings inaccessible to the wider public. Controllable biomedical summarization directly addresses this by allowing the output summary, $S$ , to be generated conditionally not just on the source document $D$ , but also on a control signal $r$ denoting the intended audience or readability target:

$P(S \mid D, r) = \prod_{j=1}^{|S|} P(s_j \mid s_{<j}, D, r)$

This framework encompasses both persona inference (e.g., Layperson, Medical Expert) and readability-axis control (e.g., specifying a Flesch–Kincaid Grade Level) (Salvi et al., 3 Dec 2025, Luo et al., 2022).

2. Datasets for Controllable Summarization

Development and benchmarking of controllable summarization require datasets pairing the same biomedical source with summaries of systematically varying complexity and audience alignment. The PERCS dataset (Salvi et al., 3 Dec 2025) provides 500 biomedical abstracts each paired with four expert-reviewed summaries, aligned to Layperson, Premedical Student, Non-medical Researcher, and Medical Expert personas. Each summary variant exhibits statistically significant gradations in mean word count, readability (Dale–Chall, Coleman–Liau Index), and lexical diversity, as shown below:

Persona	Mean Words	Readability (DCRS↑)	Lexical Diversity↑
Layperson	270.9	Lowest	Lowest
Pre-medical Student	243.5	...	...
Researcher	246.4	...	...
Medical Expert	166.4	Highest	Highest

Earlier large-scale corpora (Luo et al., 2022) such as the PLOS-based paired technical abstract and author summary corpus (28,124 papers) enabled binary technical/plain control but lacked fine granularity of audience modeling. Shared-task datasets from BioLaySumm (Goldsack et al., 2023, Ji et al., 2024) further standardized evaluation for lay/technical summarization and readability conditioning.

3. Techniques for Controllability

Prompt Engineering and Control Tokens

In persona-guided paradigms such as PERCS, control is introduced via meticulously developed prompt templates encoding requirements for vocabulary, maximum allowed jargon, tone, and length per persona, e.g., “avoid any jargon exceeding a 10th-grade vocabulary” for lay output. No bespoke loss functions or special tokens are necessary; model instruction adherence is evaluated post-generation (Salvi et al., 3 Dec 2025). Earlier work leveraged binary control tokens (e.g., “[Abstract]” vs “[Summary]”) prepended to the input to direct encoder–decoder architectures to produce the desired style (Luo et al., 2022, Goldsack et al., 2023).

Architectural Modifications

Abstractive approaches traditionally employ encoder–decoder frameworks (BART, Longformer-Encoder-Decoder), optionally with style-specific decoders (multi-head setup) or global-attention input tokens as control variables (Luo et al., 2022). Extractive models select subsets of sentences based on expert/lay target labels but display limited capacity for lexical/syntactic adaptation and high n-gram overlap with source (Luo et al., 2022).

Retrieval-Augmented and RL-Based Methods

The RAG-RLRC-LaySum system couples retrieval-augmented generation (input concatenation of retrieved explanatory Wikipedia passages, retrieved via BM25/ColBERT/BGE-v2) with reinforcement learning for Readability Control (RLRC) using Proximal Policy Optimization (Ji et al., 2024). The reward incorporates a differentiable Gaussian penalty centered on a target Flesch–Kincaid Grade Level, along with optional relevance (ROUGE/BERTScore) and length terms. This approach achieves improved readability and informativeness while preserving factual accuracy.

4. Evaluation Metrics and Protocols

Systematic evaluation of controllable summarization spans multiple axes:

Relevance: ROUGE-N (n-gram overlap), ROUGE-L (longest common subsequence F1), SARI (precision of n-gram additions, deletions, keeps), BERTScore (embedding-based token similarity) (Salvi et al., 3 Dec 2025, Luo et al., 2022).
Readability: Flesch-Kincaid Grade Level (FKGL), Dale–Chall Readability Score, Coleman–Liau Index, and the LENS learnable simplification metric (Salvi et al., 3 Dec 2025, Ji et al., 2024).
Factual Consistency: SummaC (NLI-based) (Salvi et al., 3 Dec 2025, Ji et al., 2024), BARTScore (src→hyp) (Goldsack et al., 2023).
Readability-Aware Metrics: Novel masked noun-phrase complexity scores (NPTC and RNPTC) using BERT fill-in likelihoods correlate more accurately with true technical/lay gradations than traditional formulas (Luo et al., 2022).

Manual reviews further annotate comprehensiveness, factuality, persona alignment, and overall usefulness, utilizing Likert rubrics and inter-rater reliability (Krippendorff’s $\alpha$ ) (Salvi et al., 3 Dec 2025).

5. Model Performance and Error Analysis

Empirical studies converge on several findings:

Prompting: Few-shot persona- or style-specific prompts yield the best trade-off between detail and clarity for diverse reader types (Salvi et al., 3 Dec 2025, Goldsack et al., 2023).
Model Behaviors: LLMs such as GPT-4o and LLaMA-3 produce summaries maintaining audience-aligned complexity; Gemini models risk overloading lay outputs with extraneous background (Salvi et al., 3 Dec 2025).
RL-Based Readability Control: RAG-RLRC provides ~1–3% improvements in informativeness and ~1–3% reductions in grade-level metrics for lay summaries relative to generative and LLM baselines on PLOS/eLife (Ji et al., 2024). It maintains factuality as measured by AlignScore/SummaC.
Limitations: Control strength is weaker than in human-authored pairs; generated plain-language summaries remain more complex and closer in style to their technical counterparts than reference summaries (Luo et al., 2022).
Error Types: Persona misalignment, omission, hallucination, and contradictory definitions are systematically annotated in high-quality datasets (PERCS) (Salvi et al., 3 Dec 2025).

6. Open Challenges and Future Directions

Primary challenges include insufficient stylistic divergence (limited readability gap), incomplete factual alignment under strong rewriting, and the coarseness of current binary or categorical control schemes. Future research directions include:

Fine-grained and Multi-dimensional Control: Extending beyond binary (plain/technical) or coarse persona categories to continuous control variables over grade level, jargon intensity, or information depth (Luo et al., 2022, Salvi et al., 3 Dec 2025).
Joint Optimization: Multi-objective training that simultaneously targets relevance, readability, and factuality losses (Goldsack et al., 2023).
Retrieval and Knowledge Integration: Augmenting generation with knowledge graphs, external evidence retrieval, and explainable passage grounding (Ji et al., 2024, Salvi et al., 3 Dec 2025).
Adaptive and Multimodal Summarization: Personalization to reader preferences, multi-agent iterative summarization, and extension to multimodal (figure/table/text) content (Salvi et al., 3 Dec 2025).

7. Impact and Research Landscape

Controllable biomedical summarization underpins effective dissemination of research findings in medicine, biomedical sciences, and healthcare policy. By enabling targeted summarization for personas ranging from lay audiences to domain experts, this field supports patient empowerment, interdisciplinary collaboration, and rapid knowledge transfer. Benchmarks such as PERCS (Salvi et al., 3 Dec 2025), BioLaySumm (Goldsack et al., 2023), and PLOS author summary corpora (Luo et al., 2022) are foundational for measuring progress and catalyzing methodological innovation, anchoring further exploration in retrieval-based, RL-driven, and persona-adaptive summarization approaches.