Towards Informative Few-Shot Prompt with Maximum Information Gain for In-Context Learning

Published 13 Oct 2023 in cs.CL | (2310.08923v1)

Abstract: LLMs possess the capability to engage In-context Learning (ICL) by leveraging a few demonstrations pertaining to a new downstream task as conditions. However, this particular learning paradigm suffers from high instability stemming from substantial variances induced by factors such as the input distribution of selected examples, their ordering, and prompt formats. In this work, we demonstrate that even when all these factors are held constant, the random selection of examples still results in high variance. Consequently, we aim to explore the informative ability of data examples by quantifying the Information Gain (IG) obtained in prediction after observing a given example candidate. Then we propose to sample those with maximum IG. Additionally, we identify the presence of template bias, which can lead to unfair evaluations of IG during the sampling process. To mitigate this bias, we introduce Calibration Before Sampling strategy. The experimental results illustrate that our proposed method can yield an average relative improvement of 14.3% across six classification tasks using three LLMs.

Abstract PDF HTML Upgrade to Chat

References (35)

Citations (8)

View on Semantic Scholar

Summary

The paper shows that random prompt selection leads to substantial performance variance in in-context learning, even under fixed input conditions.
The proposed method uses an Information Gain metric, adjusted via Calibration Before Sampling, to effectively identify highly informative examples.
Empirical results demonstrate up to a 14.3% accuracy improvement across various LLMs and tasks, validating the method’s robustness.

Maximum Information Gain Sampling for Informative Few-Shot Prompt Selection in In-Context Learning

Introduction

In-context learning (ICL) with LLMs has emerged as an effective paradigm for few-shot adaptation without parameter updates. Critical factors such as input distribution, demonstration ordering, and prompt format have been identified as primary sources of variance in ICL performance. However, this paper, "Towards Informative Few-Shot Prompt with Maximum Information Gain for In-Context Learning" (2310.08923), demonstrates that even under fixed input distributions and prompt formats, the random selection of in-context examples yields substantial performance instability. The authors propose a fundamentally information-theoretic approach—quantifying the informative ability of candidate examples via Information Gain (IG)—and establish a rigorous selection pipeline that maximizes IG, corrects for template bias, and offers strong empirical gains.

Motivation and Analysis of Variance in ICL

The extensive variance in ICL, even with controlled factors, exposes a gap in understanding what constitutes a 'good' demonstration. The authors empirically show that random selection under fixed prompt configurations leads to wide fluctuations in performance (Figure 1), indicating that distinct data samples within the same class do not contribute equally to downstream prediction.

Figure 1: Four-shot ICL performance variance on SST-2; identical prompt format and class order still yield large accuracy fluctuations, highlighting unequal informativeness among candidate demonstrations.

Quantifying this variance motivates the shift from traditional random or semantically-driven example selection towards principled measures of informativeness grounded in information theory.

Methodology: Information Gain and Template Bias Calibration

The core of the approach is the deployment of Information Gain (IG) as the selection metric for few-shot prompts. Each candidate from the unlabeled pool is evaluated as a potential one-shot or few-shot demonstration. IG is formally operationalized as the reduction in the conditional entropy of the label distribution $Y$ after observing candidate $x_{ob}$ , i.e., $IG(Y, x_{ob}) = H(Y) - H(Y|x_{ob})$ , but since $H(Y)$ is constant for a fixed task, maximizing IG reduces to minimizing $H(Y|x_{ob})$ .

However, a critical insight is the presence of template bias in LLMs: even content-free prompts can elicit highly skewed (sometimes over 90% for one label) predictions (Figure 2). This phenomenon distorts the actual informativeness attributed to candidate examples.

Figure 2: Template bias in SST-2 tasks; uncalibrated models strongly favor certain output classes for content-free templates, necessitating calibration prior to IG computation.

To neutralize template bias, the authors introduce Calibration Before Sampling (CBS): conditional probabilities are recalibrated using normalization statistics computed from content-free strings provided in the task template. This ensures that IG estimates reflect true informativeness, not artifacts of prompt format or spurious label priors.

The complete selection pipeline is illustrated in Figure 3.

Figure 3: Overview of the CBS MaxIG selection pipeline: zero-shot prompt construction, IG computation, template bias estimation and calibration, followed by demonstration sampling and annotation.

Experimental Results

Experiments involve three LLMs (GPT-2 XL, GPT-J, GPT-3 davinci) and six classification tasks, with comprehensive baselines (random selection and max entropy methods). The CBS MaxIG approach achieves, on average, a 14.3% relative improvement in one-shot accuracy across tasks, outperforming all baselines.

Notably, CBS MaxIG demonstrates robustness not only for one-shot, but also for four-shot settings, systematically outperforming random selection and max entropy in balanced and unbalanced class scenarios on SST-2 (Figure 4).

Figure 4: Four-shot performance for multiple selection methods and class permutations on SST-2; CBS MaxIG almost uniformly surpasses alternative selection strategies.

Further, the method is orthogonal to post-calibration and order probing: integration with these methods yields additive benefits, highlighting CBS MaxIG’s complementarity (Figure 5).

Figure 5: Ablation and integration with ordering and post-calibration strategies in four-shot learning, showing enhanced performance when combined with CBS MaxIG.

Individual top-IG examples consistently outperform those randomly chosen (Figure 6), demonstrating that informativeness via IG is a reliable selection principle.

Figure 6: One-shot accuracy for examples with the highest IG; selected samples consistently generalize better than randomly chosen demonstrations.

Analysis and Implications

Ablation studies highlight the necessity of CBS: IG estimates without calibration are distorted, and also, direct application of CBS to max entropy selection (CBS MaxEntropy) leads to a marked performance drop, further validating that IG, not uncertainty sampling, is the operative selection metric for ICL. This stands in contrast to active learning settings where model parameters are adapted iteratively and high uncertainty points are preferred.

Substituting gold labels with random labels for high-IG demonstrations leads to much larger accuracy drops than for random demonstrations, indicating CBS MaxIG selects genuinely informative and label-sensitive examples. This reinforces the idea that informativeness is not a mere artifact of surface features but is deeply tied to correct semantic and label alignment.

Future Directions

Methodological extensions include the adaptation of IG-based sampling to open-ended generation tasks, where tractable definitions of information gain must contend with variable-length and high-entropy outputs; investigation into diversity-aware informative selection; and efficient strategies for reusing IG computations across LLMs. Additionally, the present approach is task- and model-specific, necessitating new computation for each deployment.

Conclusion

This work establishes that information-theoretic sampling—maximizing information gain after careful template bias calibration—is a robust and effective approach for task-level demonstration selection in in-context learning. The approach consistently yields large gains for multiple task types and LLMs, is synergistic with ordering and calibration methods, and challenges existing conventions from active learning. These findings have significant implications for future research into data-efficient, data-centric ICL strategies, especially as LLMs become more ubiquitous and applications scale to broader domains.

Markdown Report Issue