Few-Shot QA Research

Updated 17 January 2026

Few-shot Question Answering is a paradigm that builds systems with 16–128 labeled examples, rethinking pretraining and fine-tuning for enhanced performance.
Innovative methods—such as recurring span selection with QASS, prompt-based tuning on text-to-text models, and synthetic data augmentation—drive significant F1 improvements.
Practical applications span extractive, generative, multi-hop, multilingual, and multimodal tasks, ensuring robust adaptation even under severe data scarcity.

Few-shot Question Answering (QA) refers to building systems capable of answering natural language queries given only a handful of labeled question–answer examples, often in the range of 16–128 per domain. Unlike standard QA paradigms that leverage massive annotated corpora (≥10⁵ examples), few-shot scenarios necessitate rethinking both pre-training objectives and fine-tuning protocols due to severe data scarcity. Recent work demonstrates that naïvely fine-tuning models like BERT or RoBERTa, originally built for masked language modeling (MLM), yields markedly poor performance under few-shot conditions, especially with weak question–context alignment. Advances in this field center on pretraining schemes that directly encourage span selection for QA, prompt-based tuning on text-to-text architectures, synthetic data augmentation, contrastive learning strategies, and retrieval-augmented in-context exemplars. Few-shot QA now spans extractive, generative, multi-hop, knowledge-base, multimodal, and multilingual tasks.

1. Pretraining Objectives: Recurring Span Selection and Inductive Alignment

Traditional masked LM pretraining (BERT/SpanBERT) exploits local context patterns but fails to impart a direct mapping from question representation to answer-span selection, especially when fine-tuning on small datasets. Ram et al. introduce the recurring span selection pretraining objective ("Few-Shot Question Answering by Pretraining Span Selection" (Ram et al., 2021)). Spans (n-grams, entities, noun phrases) that repeat ≥2× per passage are grouped, all but one are masked with a special [QUESTION] token, and the model must select the unmasked “answer” occurrence among candidate spans for each cluster. The architecture employs a Question-Aware Span Selection (QASS) head, using dynamic vectors S·x_q and E·x_q, explicitly parameterized by the [QUESTION] representation.

This pretraining formulation injects a strong question–answer alignment bias. Splinter (base model) achieves dramatic gains: on SQuAD, 72.7 F1 with just 128 examples (vs. 43.0 for RoBERTa-base, 55.8 for SpanBERT-base). The QASS head alone yields improvement, but the full recurring-span pretraining is necessary for largest 16–512 shot gains. The alignment between pretraining and fine-tuning tasks is evidenced by low representation drift (cosine similarity ≈0.89 pre- vs. post-finetune), and ablations show consistently superior sample efficiency. Recurring-span frameworks generalize robustly to diverse QA benchmarks (e.g., biology, textbooks, trivia), with >20 F1 point improvements over vanilla baselines in the few-shot regime.

2. Prompt-tuning and Text-to-Text Generation Paradigms

Recent advances leverage the alignment between the pre-training and fine-tuning tasks in encoder–decoder architectures (T5, BART), utilizing prompt-based input and output formatting. The FewshotQA framework ("FewshotQA: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models" (Chada et al., 2021)) reorders inputs as “Question: … Answer: <mask> Context: …” and aligns the generation objective directly with pre-training. Empirical results show that prompt-based tuning on BART-large reaches 72.3 F1 on SQuAD with only 32 examples; FewshotBART provides up to +34.2 F1 improvements with 16-shot training over standard span-selection BART on 8 datasets.

Ablation studies find that any generation-based objective aligned to the pre-training task drastically outperforms pure span-selection in the ultra-low data regime, with the Q→A format providing best gains. Scaling model size further boosts performance (~10–15 F1 improvement for BART-large over BART-base). Multilingual extension with mBART-50 yields up to +40 F1 on TyDiQA (2–64 shots) over XLM-RoBERTa-large. Prompt-tuning with “soft prompts”—trainable input embeddings—is analytically shown to match or exceed full model tuning when good initialization and prior pre-training are present ("Few-shot Unified Question Answering: Tuning Models or Prompts?" (Bansal et al., 2023)). Format-based prompt sharing and multi-task pre-training further amplify transfer.

3. Synthetic Data Generation and Data Augmentation

In settings where annotation is costly or domains are specialized, synthetic question–answer generation using LLMs can effectively bootstrap few-shot performance. Synthetic data is produced by prompting an LM to generate diverse questions given context and answer spans. Prompting-based pipelines employ encoder–decoder models (T5-large) with learned prompt tokens ("Prompting-based Synthetic Data Generation for Few-Shot Question Answering" (Schmidt et al., 2024)). A two-step process is used: (1) candidate answer sampling via entity recognition or rule-based heuristics, and (2) LM-based conditional question generation, filtered by consistency and linguistic rules. Training exclusively on synthetic pairs enables state-of-the-art zero- and few-shot F1 scores (85.5 F1 zero-shot SQuAD; ≥88 F1 with only 128 examples), nearly matching human-annotated quality with 128–256 synthetic pairs. Ablations show marginal gap (~1–1.3 F1) between gold and NER-sampled answers.

Gotta (Chen et al., 2023) amplifies few-shot learning by integrating large-scale cloze-style (fill-in-the-blank entity masking) augmentation: for every original QA sample, entity spans are masked to produce cloze questions in identical prompt format. Joint prompt-based generative loss over QA+cloze improves semantic alignment and yields +2–34 F1 gain for entity-rich benchmarks, outperforming random span masking. Multi-task decoders underperform vis-à-vis unified prompt-tuned frameworks.

4. Knowledge-Augmented and Contrastive Few-shot Learning

KECP ("Knowledge Enhanced Contrastive Prompting for Few-shot Extractive Question Answering" (Wang et al., 2022)) repurposes extractive QA as a Masked Language Modeling generation task, with rich injection of external knowledge-base embeddings (WikiData5M, via ConVE). The Knowledge-aware Prompt Encoder augments passage and prompt tokens with KB vectors, activating cross-attention only for selected soft-masked tokens. Contrastive learning is applied over hard negative spans, enforcing robust separation between the true answer and semantically similar distractors.

In few-shot (16–128 shots) SQuAD2.0, KECP attains 75.45% F1 (Splinter 53.05%; P-tuning V2 60.48%). Ablations removing KB augmentation, contrastive loss, or cross-attentive prompt injection reveal each component is essential; sample efficiency curves confirm dominance in 16–256 shot regimes. The architecture remains decodable only to valid passage substrings via prefix-tree search. KB signals are injected sparingly to avoid overfitting, and the model applies to generative, multi-hop, and multiple-choice QA by extending masked templates.

5. Few-shot QA in Knowledge Bases, Multi-hop, and Multimodal Domains

Knowledge-base QA introduces further complexity: answering often requires inducing discrete programs with compositional reasoning. Meta-reinforcement learning approaches ("Few-Shot Complex Knowledge Base Question Answering via Meta Reinforcement Learning" (Hua et al., 2020)) adapt a global policy to new questions using a handful of nearest-neighbor support exemplars. The RL programmer (attention-based seq2seq) rapidly tailors its parameters via inner optimization; meta-training learns an initialization favorable for fast adaptation across question types, handling distributional bias. State-of-the-art macro-F1/micro-F1 (66.25/77.71) is achieved with only five support questions and sparse meta-training.

Pipeline methods for multi-hop KBQA ("Few-shot Multi-hop Question Answering over Knowledge Base" (Fan et al., 2021)) restrict search with hand-crafted templates, synthesize thousands of artificial question–schema pairs, and apply BERT-based contextual entity linking. With just 10% annotated examples plus synthetic pairs, the system attains 58.54% F1 (vs. 56.70–62.55 for full-data). Extensions via template-based reasoning and beam search generalize to reading comprehension, open-domain, and multimodal QA. In multimodal settings ("Electrocardiogram-LLM for Few-Shot Question Answering with Meta Learning" (Tang et al., 2024)), meta-learning architectures combine signal encoders (ECG) with frozen LLMs via trainable fusion modules; episodic MAML adaptation succeeds across 5-way 5-shot verification, choice, and query tasks.

6. In-context Learning, Retrieval-Augmented Generation, and Dynamic Demonstration Selection

The paradigm shift toward in-context learning with large LLMs enables high performance with no direct parameter updates. MFORT-QA ("MFORT-QA: Multi-hop Few-shot Open Rich Table Question Answering" (Guan et al., 2024)) combines few-shot exemplar selection via dense IR retrieval (Sentence-BERT) with chain-of-thought prompting and retrieval-augmented generation (RAG). Contexts (tables and hyperlinks) are selected by cosine similarity, exemplars are chosen from the nearest training triples, and multi-hop questions are decomposed inductively. Empirical results on OTT-QA demonstrate 4× improvement in exact match over zero-shot baselines. Best practices include limiting retrieved contexts, controlling exemplar diversity, and structuring prompts in Q–T–A format.

Dynamic retrieval of few-shot demonstrations (DFSL, "Dynamic Few-Shot Learning for Knowledge Graph Question Answering" (D'Abramo et al., 2024)) further refines in-context sample selection. Top-k relevant exemplars—encoded by SBERT—are fetched from a growing memory store for each query. Multi-query beam search and answer set selection mitigate subject–object swap errors. Across benchmarks (QALD-9 Plus, QALD-10, LC-QUAD 2.0), DFSL delivers up to +30 F1 over static few-shot, often matching fine-tuned SOTA without re-training. Strengths include out-of-domain generalization and model-agnostic composition.

7. Multilingual and Zero-Shot Few-Shot QA

Multilingual few-shot QA leverages LLMs and synthetic data pipelines to overcome annotation bottlenecks for underrepresented languages. FsModQA ("Few-Shot Multilingual Open-Domain QA from 5 Examples" (Jiang et al., 27 Feb 2025)) synthesizes 1.7M multilingual QA pairs with only five human examples per language, using cross-lingual prompting and entailment filtering. Joint retriever-reader training with contrastive and cross-attention losses on mT5-L yields state-of-the-art F1 (e.g., 38.2 on XOR-Full QA across 8 languages, 25.0 on MKQA-26) with minimal supervision, outperforming direct fine-tuning and prior methods. Zero-shot adaptation via bilingual prompt translation achieves within 1–2 F1 of in-language few-shot.

References

Ram et al., "Few-Shot Question Answering by Pretraining Span Selection" (Ram et al., 2021)
Banerjee et al., "FewshotQA: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models" (Chada et al., 2021)
Wang et al., "Few-shot Unified Question Answering: Tuning Models or Prompts?" (Bansal et al., 2023)
Sutanto & Santoso, "LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering" (Sutanto et al., 2024)
Chen et al., "Gotta: Generative Few-shot Question Answering by Prompt-based Cloze Data Augmentation" (Chen et al., 2023)
Xie et al., "KECP: Knowledge Enhanced Contrastive Prompting for Few-shot Extractive Question Answering" (Wang et al., 2022)
Faldu et al., "Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability" (Sawhney et al., 2024)
Liu et al., "Few-shot Multi-hop Question Answering over Knowledge Base" (Fan et al., 2021)
Wang et al., "MFORT-QA: Multi-hop Few-shot Open Rich Table Question Answering" (Guan et al., 2024)
Wang et al., "Prompting-based Synthetic Data Generation for Few-Shot Question Answering" (Schmidt et al., 2024)
Schafer et al., "Few-Shot Multilingual Open-Domain QA from 5 Examples" (Jiang et al., 27 Feb 2025)
Liu et al., "Dynamic Few-Shot Learning for Knowledge Graph Question Answering" (D'Abramo et al., 2024)
Xie et al., "Learning Compositional Representation for Few-shot Visual Question Answering" (Guo et al., 2021)
Chen et al., "Few-Shot Complex Knowledge Base Question Answering via Meta Reinforcement Learning" (Hua et al., 2020)
Li et al., "Electrocardiogram-LLM for Few-Shot Question Answering with Meta Learning" (Tang et al., 2024)