Papers
Topics
Authors
Recent
Search
2000 character limit reached

Towards Informative Few-Shot Prompt with Maximum Information Gain for In-Context Learning

Published 13 Oct 2023 in cs.CL | (2310.08923v1)

Abstract: LLMs possess the capability to engage In-context Learning (ICL) by leveraging a few demonstrations pertaining to a new downstream task as conditions. However, this particular learning paradigm suffers from high instability stemming from substantial variances induced by factors such as the input distribution of selected examples, their ordering, and prompt formats. In this work, we demonstrate that even when all these factors are held constant, the random selection of examples still results in high variance. Consequently, we aim to explore the informative ability of data examples by quantifying the Information Gain (IG) obtained in prediction after observing a given example candidate. Then we propose to sample those with maximum IG. Additionally, we identify the presence of template bias, which can lead to unfair evaluations of IG during the sampling process. To mitigate this bias, we introduce Calibration Before Sampling strategy. The experimental results illustrate that our proposed method can yield an average relative improvement of 14.3% across six classification tasks using three LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Robert B Ash. 2012. Information theory. Courier Corporation.
  2. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Ting-Yun Chang and Robin Jia. 2022. Careful data curation stabilizes in-context learning. arXiv preprint arXiv:2212.10378.
  5. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. Ido Dagan and Sean P Engelson. 1995. Committee-based sampling for training probabilistic classifiers. In Machine Learning Proceedings 1995, pages 150–157. Elsevier.
  7. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment: First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, pages 177–190. Springer.
  8. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. Active Learning for BERT: An Empirical Study. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7949–7962, Online. Association for Computational Linguistics.
  11. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
  12. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  13. Selective annotation makes language models better few-shot learners. In The Eleventh International Conference on Learning Representations.
  14. Xiaonan Li and Xipeng Qiu. 2023. Finding supporting examples for in-context learning. arXiv preprint arXiv:2302.13539.
  15. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
  16. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.
  17. Active learning by acquiring contrastive examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 650–663, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  18. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  19. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
  20. John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74.
  21. Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633.
  22. Burr Settles. 2009. Active learning literature survey.
  23. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  24. Ellen M Voorhees and Dawn M Tice. 2000. Building a question answering test collection. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 200–207.
  25. Better zero-shot reasoning with self-adaptive prompting. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3493–3514, Toronto, Canada. Association for Computational Linguistics.
  26. Universal self-adaptive prompting. arXiv preprint arXiv:2305.14926.
  27. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  28. Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6 billion parameter autoregressive language model.
  29. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  30. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1423–1436, Toronto, Canada. Association for Computational Linguistics.
  31. Cold-start data selection for few-shot language model fine-tuning: A prompt-based uncertainty propagation approach. arXiv preprint arXiv:2209.06995.
  32. Cold-start active learning through self-supervised language modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7935–7948, Online. Association for Computational Linguistics.
  33. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
  34. Active example selection for in-context learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9134–9148, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  35. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
Citations (8)

Summary

  • The paper shows that random prompt selection leads to substantial performance variance in in-context learning, even under fixed input conditions.
  • The proposed method uses an Information Gain metric, adjusted via Calibration Before Sampling, to effectively identify highly informative examples.
  • Empirical results demonstrate up to a 14.3% accuracy improvement across various LLMs and tasks, validating the method’s robustness.

Maximum Information Gain Sampling for Informative Few-Shot Prompt Selection in In-Context Learning

Introduction

In-context learning (ICL) with LLMs has emerged as an effective paradigm for few-shot adaptation without parameter updates. Critical factors such as input distribution, demonstration ordering, and prompt format have been identified as primary sources of variance in ICL performance. However, this paper, "Towards Informative Few-Shot Prompt with Maximum Information Gain for In-Context Learning" (2310.08923), demonstrates that even under fixed input distributions and prompt formats, the random selection of in-context examples yields substantial performance instability. The authors propose a fundamentally information-theoretic approach—quantifying the informative ability of candidate examples via Information Gain (IG)—and establish a rigorous selection pipeline that maximizes IG, corrects for template bias, and offers strong empirical gains.

Motivation and Analysis of Variance in ICL

The extensive variance in ICL, even with controlled factors, exposes a gap in understanding what constitutes a 'good' demonstration. The authors empirically show that random selection under fixed prompt configurations leads to wide fluctuations in performance (Figure 1), indicating that distinct data samples within the same class do not contribute equally to downstream prediction. Figure 1

Figure 1

Figure 1: Four-shot ICL performance variance on SST-2; identical prompt format and class order still yield large accuracy fluctuations, highlighting unequal informativeness among candidate demonstrations.

Quantifying this variance motivates the shift from traditional random or semantically-driven example selection towards principled measures of informativeness grounded in information theory.

Methodology: Information Gain and Template Bias Calibration

The core of the approach is the deployment of Information Gain (IG) as the selection metric for few-shot prompts. Each candidate from the unlabeled pool is evaluated as a potential one-shot or few-shot demonstration. IG is formally operationalized as the reduction in the conditional entropy of the label distribution YY after observing candidate xobx_{ob}, i.e., IG(Y,xob)=H(Y)−H(Y∣xob)IG(Y, x_{ob}) = H(Y) - H(Y|x_{ob}), but since H(Y)H(Y) is constant for a fixed task, maximizing IG reduces to minimizing H(Y∣xob)H(Y|x_{ob}).

However, a critical insight is the presence of template bias in LLMs: even content-free prompts can elicit highly skewed (sometimes over 90% for one label) predictions (Figure 2). This phenomenon distorts the actual informativeness attributed to candidate examples. Figure 2

Figure 2: Template bias in SST-2 tasks; uncalibrated models strongly favor certain output classes for content-free templates, necessitating calibration prior to IG computation.

To neutralize template bias, the authors introduce Calibration Before Sampling (CBS): conditional probabilities are recalibrated using normalization statistics computed from content-free strings provided in the task template. This ensures that IG estimates reflect true informativeness, not artifacts of prompt format or spurious label priors.

The complete selection pipeline is illustrated in Figure 3. Figure 3

Figure 3: Overview of the CBS MaxIG selection pipeline: zero-shot prompt construction, IG computation, template bias estimation and calibration, followed by demonstration sampling and annotation.

Experimental Results

Experiments involve three LLMs (GPT-2 XL, GPT-J, GPT-3 davinci) and six classification tasks, with comprehensive baselines (random selection and max entropy methods). The CBS MaxIG approach achieves, on average, a 14.3% relative improvement in one-shot accuracy across tasks, outperforming all baselines.

Notably, CBS MaxIG demonstrates robustness not only for one-shot, but also for four-shot settings, systematically outperforming random selection and max entropy in balanced and unbalanced class scenarios on SST-2 (Figure 4). Figure 4

Figure 4: Four-shot performance for multiple selection methods and class permutations on SST-2; CBS MaxIG almost uniformly surpasses alternative selection strategies.

Further, the method is orthogonal to post-calibration and order probing: integration with these methods yields additive benefits, highlighting CBS MaxIG’s complementarity (Figure 5). Figure 5

Figure 5: Ablation and integration with ordering and post-calibration strategies in four-shot learning, showing enhanced performance when combined with CBS MaxIG.

Individual top-IG examples consistently outperform those randomly chosen (Figure 6), demonstrating that informativeness via IG is a reliable selection principle. Figure 6

Figure 6: One-shot accuracy for examples with the highest IG; selected samples consistently generalize better than randomly chosen demonstrations.

Analysis and Implications

Ablation studies highlight the necessity of CBS: IG estimates without calibration are distorted, and also, direct application of CBS to max entropy selection (CBS MaxEntropy) leads to a marked performance drop, further validating that IG, not uncertainty sampling, is the operative selection metric for ICL. This stands in contrast to active learning settings where model parameters are adapted iteratively and high uncertainty points are preferred.

Substituting gold labels with random labels for high-IG demonstrations leads to much larger accuracy drops than for random demonstrations, indicating CBS MaxIG selects genuinely informative and label-sensitive examples. This reinforces the idea that informativeness is not a mere artifact of surface features but is deeply tied to correct semantic and label alignment.

Future Directions

Methodological extensions include the adaptation of IG-based sampling to open-ended generation tasks, where tractable definitions of information gain must contend with variable-length and high-entropy outputs; investigation into diversity-aware informative selection; and efficient strategies for reusing IG computations across LLMs. Additionally, the present approach is task- and model-specific, necessitating new computation for each deployment.

Conclusion

This work establishes that information-theoretic sampling—maximizing information gain after careful template bias calibration—is a robust and effective approach for task-level demonstration selection in in-context learning. The approach consistently yields large gains for multiple task types and LLMs, is synergistic with ordering and calibration methods, and challenges existing conventions from active learning. These findings have significant implications for future research into data-efficient, data-centric ICL strategies, especially as LLMs become more ubiquitous and applications scale to broader domains.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.