Zero-Shot LLM Summarisation

Updated 7 December 2025

Zero-shot LLM summarisation is a process that uses instruction-tuned language models with prompt engineering to generate abstractive or extractive summaries without domain-specific fine-tuning.
It employs methods like controlled prompt phrasing, chunking strategies, and iterative refinement to manage summary length, style, and factual accuracy across diverse content.
The approach shines in cross-domain and cross-lingual applications while facing challenges such as hallucination and position bias that invite further research.

Zero-shot LLM summarisation refers to the application of LLMs to generate summaries for text (or multimodal) inputs without the use of any summarisation-specific fine-tuning or supervision on in-domain summary data. Instead, the summariser operates entirely via prompting and leverages knowledge acquired during pretraining and, in many cases, via instruction tuning. This paradigm has become central due to its strong out-of-the-box performance, cross-domain and cross-lingual generalization, interpretability, and reduced dependence on expensive annotation.

1. Foundations and Formal Definitions

Zero-shot LLM summarisation is defined as the process where, given a model $\mathcal{M}$ (often an instruction-tuned LLM, e.g., GPT-4, Llama-2, Mistral-7B, etc.) and a document $D$ , a summary $S = \mathcal{M}(D)$ is generated solely by prompting ("Summarize the following text...") with no fine-tuning on summarisation corpora (Pu et al., 2023, Zhang et al., 2023). The setting may be abstractive (the model generates linguistic abstractions) or extractive (selecting segments verbatim). For specialized cases—such as cross-lingual summarisation—the model performs both abstraction and translation, still in a zero-shot manner (Wang et al., 2023, Li et al., 2024). In temporal and multi-modal domains, summarisation may involve complex event, frame, or scene structures, and zero-shot LLMs are applied via textual or multimodal prompts (Wu et al., 20 Oct 2025, Kruse et al., 30 Jan 2025, Jia et al., 2022).

Formally, let $S^\star = \arg\max_{S} P_{\mathcal{M}}(S\,|\,\text{“Summarize: ”}\,\Vert\, D)$ (Pu et al., 2023), where $P_{\mathcal{M}}$ is the LLM, and summarisation is accomplished in a single inference pass.

2. Prompt Design, Length Control, and Chunking

Prompt engineering is central to zero-shot LLM summarisation. Empirical studies consistently show that both the phrasing and the complexity of prompts have significant effects on length, abstraction, factuality, and hallucination rates (Manuvinakurike et al., 2023, Retkowski et al., 2024, Jaaouine et al., 30 Nov 2025). Prompt templates may impose constraints on summary length ("in two sentences" or "50 words") or guide style ("extractive/abstractive," "as an expert," etc.) (Retkowski et al., 2024, Aly et al., 7 Jul 2025). Precise length control in zero-shot settings remains challenging. Retkowski & Waibel introduce a four-step, model-agnostic pipeline—length-approximation, target adjustment, sample filtering, and automated revision—to achieve over 90% length compliance without tuning (Retkowski et al., 2024).

For long-text summarisation (e.g., scientific papers, clinical notes), prompt-window constraints necessitate chunking strategies. Sentences are partitioned to fit the LLM's context window; each chunk is summarised independently via zero-shot prompts, then higher-level summarisation is recursively applied to partial summaries (Aly et al., 7 Jul 2025, Kruse et al., 30 Jan 2025).

3. Robustness, Hallucination, and Factuality

Ensuring factual consistency is a core concern. Zero-shot LLMs often outperform fine-tuned or human baselines in overall factuality and extrinsic hallucination rates on general domains, but brittleness and domain-transfer failures still occur (Pu et al., 2023, Ramprasad et al., 2024). Hallucination remains an open problem in specialized settings (e.g., scientific, medical, or legal summarisation) (Jaaouine et al., 30 Nov 2025, Kruse et al., 30 Jan 2025). Mitigation strategies include:

Prompt repetition: Simple repetition of salient or random sentences in the prompt (Context Repetition/Random Addition) can reduce hallucination by anchoring LLM attention to factual spans, improving lexical and semantic alignment by up to +0.14 in mean ROUGE/BERTScore with $p < 0.001$ (Jaaouine et al., 30 Nov 2025).
Extraction prompting: Encouraging extractive or “extract and compress” summarisation reduces unsupported generation (Ramprasad et al., 2024).
Meta-generation: Iterative reflection stages (e.g., improve-before-translate; self-critique) as in the SITR pipeline reduce both hallucination and translation error in cross-lingual settings (Li et al., 2024).
Relevance paraphrasing probes: Minimal paraphrases of key source sentences can cause marked performance drops—up to 50% loss in ROUGE-2—indicating zero-shot LLMs are not semantically robust and often rely on surface heuristics or position (Askari et al., 2024).

Factuality metrics (QAFactEval, QuestEval, SummaC, CUI-F1, etc.) only partially correlate with human judgments, especially in low-resource or specialty domains (Spearman $\rho$ often below $0.3$). Expert annotation remains indispensable for evaluation in these settings (Ramprasad et al., 2024).

4. Position Bias, Generalization, and Style

Abstractive zero-shot LLM summarizers exhibit strong position bias, a generalization of classic “lead bias” (preference for early input content). This bias is quantified as the Wasserstein distance between positional distributions of gold and model-generated summary sentences (Chhabra et al., 2024). In realistic tasks (e.g., Reddit, XSum), even SOTA LLMs assign disproportionate importance to leads unless prompted or fine-tuned otherwise. This can result in information loss, fairness problems, or non-faithful outputs. Instruction-tuned models (GPT-3.5-Turbo, Llama-2) achieve the lowest positional bias except for extreme summarisation, where all models tend toward head selection (Chhabra et al., 2024). Prompt rephrasing (“role-play,” explicit instructions) offers marginal reductions but cannot fully remove bias without fine-tuning.

In terms of style, zero-shot LLMs are highly extractive by default, particularly when summarising news. Human evaluators display no uniform preference for abstractive versus extractive summaries, but stylistic trade-offs are critical for certain applications (Zhang et al., 2023). Configuration of prompt length, abstraction, and persona is thus essential.

5. Domain- and Modality-Specific Extensions

Zero-shot LLM summarisation has been successfully extended beyond news and general-domain texts:

Scientific and biomedical: Two-stage role-based prompting (“author Q&A” followed by “editorial lay summary”) substantially raises lay readability, especially with large (≥ 46 B parameter) models. LLM judges reliably recapitulate human preferences, and the “pipeline” generalizes to new academic domains (Goldsack et al., 9 Jan 2025).
Video and multimodal: Pseudo-label, rubric-based prompting enables zero-shot video summarization, outperforming all prior prompt-based and unsupervised baselines on benchmarks like SumMe (F1 = 57.6) and TVSum (F1 = 63.1) (Wu et al., 20 Oct 2025).
Cross-lingual: Multi-step meta-generated pipelines (summarise $\to$ improve $\to$ translate $\to$ refine) enable GPT-3.5/GPT-4 to deliver state-of-the-art zero-shot cross-lingual summaries on low-resource languages, with sum-ROUGE increases up to 100% over naive approaches (Li et al., 2024, Wang et al., 2023).
Multi-lingual extractive: The NLSSum framework generates, weights, and fuses multilingual extractive labels, closing the gap to supervised models on cross-lingual benchmarks (Jia et al., 2022).
Clinical: Retrieval-augmented generation (RAG) and chunking improve event coverage and temporal fidelity in long clinical note summarisation, although explicit temporal representations remain lacking (Kruse et al., 30 Jan 2025).

6. Model Properties, Evaluation, and Best Practices

Instruction tuning dominates model scale in determining zero-shot quality. Even 350M-parameter instruction-tuned LLMs produce more faithful, relevant, and coherent summaries than untuned 175B LLMs (Zhang et al., 2023). Nevertheless, API and open-source LLMs show variability by domain, dataset, and prompt (Aly et al., 7 Jul 2025). In most cases, instruct-based models (Mistral-7B-Instruct, Llama-13B-Chat, Mixtral) provide optimal efficiency-quality trade-offs.

Evaluation is multifaceted. Standard metrics (ROUGE, BERTScore, METEOR, etc.) capture surface quality; semantic and factual metrics require NLI or QA-based approaches, and their reliability falls rapidly in low-resource or specialty domains. Human head-to-head comparisons and LLM-based preference evaluations (e.g., PoLL panel) remain essential for practical assessment (Goldsack et al., 9 Jan 2025). For domain transfer, small-scale expert annotation is recommended (Ramprasad et al., 2024).

Best practices include:

Use instruction-tuned, mid-to-large LLMs for structural robustness and efficiency (Zhang et al., 2023, Aly et al., 7 Jul 2025).
Craft explicit, role-based, and length- or style-constrained prompts (Retkowski et al., 2024, Goldsack et al., 9 Jan 2025).
For long or multimodal inputs, employ chunking and recursive summarisation (Aly et al., 7 Jul 2025).
Consider prompt repetition or meta-generation steps to mitigate hallucination and improve faithfulness (Jaaouine et al., 30 Nov 2025, Li et al., 2024).
Explicitly measure and, if necessary, control for position bias (Chhabra et al., 2024).
For new domains, seed the pipeline with a small number of labeled or pseudo-labeled examples and rubric-based evaluation (Wu et al., 20 Oct 2025).
Benchmark not just with automated metrics, but head-to-head human (and, where practical, LLM) preference evaluations, particularly for lay accessibility or specialisation (Goldsack et al., 9 Jan 2025).

7. Open Challenges and Research Directions

Despite the empirical superiority of zero-shot LLM summarisation in many domains, several critical challenges remain:

Robustness: Minimal, meaning-preserving surface perturbations (relevance paraphrasing) can yield substantial performance drops, reflecting brittle reliance on heuristic pattern-matching rather than semantic reasoning (Askari et al., 2024).
Hallucination control: Simple prompt or context repetition offers only partial mitigation, especially for deeply specialized or compositional content (Jaaouine et al., 30 Nov 2025).
Temporal and logical consistency: LLMs are not yet capable of enforcing fine-grained event chronology in long clinical or process narratives without explicit temporal modeling (Kruse et al., 30 Jan 2025).
Evaluation: No single metric suffices; reference quality, metric–faithfulness correlation, and domain mismatch must all be addressed (Zhang et al., 2023, Ramprasad et al., 2024).
Abstraction vs. extraction: The default extractiveness of LLM outputs may be a limitation in domains requiring paraphrastic compression or fully novel compositionality (Zhang et al., 2023, Goldsack et al., 9 Jan 2025).
Cross-lingual and low-resource: Meta-generation pipelines are highly effective but require careful prompt curation and may not transfer optimally to ultra-low-resource languages or non-standard domains (Li et al., 2024).

Further research is centered on developing semantic-invariance–driven tuning objectives, adversarial testing via meaning-preserving perturbations, explicit penalty and reward for temporal/logical coherence, and human-centered and utility-based evaluation approaches.

Key references: (Wu et al., 20 Oct 2025, Jaaouine et al., 30 Nov 2025, Retkowski et al., 2024, Goldsack et al., 9 Jan 2025, Kruse et al., 30 Jan 2025, Aly et al., 7 Jul 2025, Askari et al., 2024, Ramprasad et al., 2024, Chhabra et al., 2024, Manuvinakurike et al., 2023, Pu et al., 2023, Wang et al., 2023, Zhang et al., 2023, Jia et al., 2022, Li et al., 2024).