"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters
Abstract: LLMs have recently emerged as an effective tool to assist individuals in writing various types of content, including professional documents such as recommendation letters. Though bringing convenience, this application also introduces unprecedented fairness concerns. Model-generated reference letters might be directly used by users in professional scenarios. If underlying biases exist in these model-constructed letters, using them without scrutinization could lead to direct societal harms, such as sabotaging application success rates for female applicants. In light of this pressing issue, it is imminent and necessary to comprehensively study fairness issues and associated harms in this real-world use case. In this paper, we critically examine gender biases in LLM-generated reference letters. Drawing inspiration from social science findings, we design evaluation methods to manifest biases through 2 dimensions: (1) biases in language style and (2) biases in lexical content. We further investigate the extent of bias propagation by analyzing the hallucination bias of models, a term that we define to be bias exacerbation in model-hallucinated contents. Through benchmarking evaluation on 2 popular LLMs- ChatGPT and Alpaca, we reveal significant gender biases in LLM-generated recommendation letters. Our findings not only warn against using LLMs for this application without scrutinization, but also illuminate the importance of thoroughly studying hidden biases and harms in LLM-generated professional documents.
- Stability AI. 2023. Stability ai launches the first of its stablelm suite of language models.
- Large language models and the perils of their hallucinations. Critical Care, 27(1):1–2.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
- The problem with bias: From allocative to representational harms in machine learning. In Proceedings of the 9th Annual Conference of the Special Interest Group for Computing, Information and Society (SIGCIS), Philadelphia, PA. Association for Computational Linguistics.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
- Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
- Shikha Bordia and Samuel R. Bowman. 2019. Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 7–15, Minneapolis, Minnesota. Association for Computational Linguistics.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Conference on Neural Information Processing Systems.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In FAccT.
- Mitigating gender bias in distilled language models via counterfactual role reversal. In Findings of the Association for Computational Linguistics: ACL 2022, pages 658–678, Dublin, Ireland. Association for Computational Linguistics.
- DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706, Online. Association for Computational Linguistics.
- On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 561–570, Dublin, Ireland. Association for Computational Linguistics.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Kate Crawford. 2017. The trouble with bias. In Conference on Neural Information Processing Systems, invited speaker.
- Melissa Cugno. 2020. Talk Like a Man: How Resume Writing Can Impact Managerial Hiring Decisions for Women. Ph.D. thesis. Copyright - Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works; Last updated - 2023-03-07.
- Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, pages 120–128.
- Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 120–128, New York, NY, USA. Association for Computing Machinery.
- On measures of biases and harms in NLP. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 246–267, Online only. Association for Computational Linguistics.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 862–872, New York, NY, USA. Association for Computing Machinery.
- Queens are powerful too: Mitigating gender bias in dialogue generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online. Association for Computational Linguistics.
- Measuring and mitigating unintended bias in text classification. New York, NY, USA. Association for Computing Machinery.
- Gender differences in recommendation letters for postdoctoral fellowships in geoscience. Nature Geoscience, 9:805–808.
- Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2197–2214, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- News summarization and evaluation in the era of gpt-3.
- Heat and moisture exchanger occlusion leading to sudden increased airway pressure: A case report using chatgpt as a personal writing assistant. Cureus, 15(4).
- Joint event and temporal relation extraction with shared representations and structured prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 434–444, Hong Kong, China. Association for Computational Linguistics.
- Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
- The factual inconsistency problem in abstractive text summarization: A survey. ArXiv, abs/2104.14839.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Is chatgpt a good translator? yes with gpt-4 as the engine.
- Gender bias in reference letters for residency and academic medicine: a systematic review. Postgraduate Medical Journal.
- Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
- Summac: Re-visiting nli-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
- Generating scholarly content with chatgpt: ethical challenges for medical publishing. The Lancet Digital Health, 5.
- Using the standardized letters of recommendation in selectionresults from a multidimensional rasch model. Educational and Psychological Measurement - EDUC PSYCHOL MEAS, 69:475–492.
- Raising doubt in letters of recommendation for academia: Gender differences and their impact. Journal of Business and Psychology, 34.
- Gender and letters of recommendation for academia: Agentic and communal differences. The Journal of applied psychology, 94:1591–9.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
- Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
- Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- OpenAI. 2022. Introducing chatgpt.
- Can gpt-3 write an academic paper on itself, with minimal human input?
- “i’m fully who i am”: Towards centering transgender and non-binary voices to measure biases in open language generation. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 1246–1266, New York, NY, USA. Association for Computing Machinery.
- Factoring the matrix of domination: A critical review and reimagination of intersectionality in ai fairness.
- Assessing gender bias in machine translation: a case study with google translate. Neural Computing and Applications, 32:6363–6381.
- Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana. Association for Computational Linguistics.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
- Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium. Association for Computational Linguistics.
- Malik Sallam. 2023. Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6).
- Towards Controllable Biases in Language Generation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3239–3254, Online. Association for Computational Linguistics.
- “nice try, kiddo”: Investigating ad hominems in dialogue responses. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 750–767, Online. Association for Computational Linguistics.
- Societal biases in language generation: Progress and challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4275–4293, Online. Association for Computational Linguistics.
- The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China. Association for Computational Linguistics.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Chris Stokel-Walker. 2023. Chatgpt listed as author on research papers: Many scientists disapprove. Nature, 613(7945):620–621.
- Fan-Keng Sun and Cheng-I Lai. 2020. Conditioned natural language generation using only unconditioned language model: An exploration. ArXiv, abs/2011.07347.
- Jiao Sun and Nanyun Peng. 2021. Men are elected, women are married: Events gender bias on Wikipedia. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 350–360, Online. Association for Computational Linguistics.
- Magdalena Szumilas. 2010. Explaining odds ratios. Journal of the Canadian Academy of Child and Adolescent Psychiatry = Journal de l’Academie canadienne de psychiatrie de l’enfant et de l’adolescent, 19 3:227–9.
- Evaluation of chatgpt as a question answering system for answering complex questions.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- FEVER: a large-scale dataset for fact extraction and VERification. In NAACL-HLT.
- Frances Trix and Carolyn E. Psenka. 2003. Exploring the color of glass: Letters of recommendation for female and male medical faculty. Discourse & Society, 14:191 – 220.
- Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
- Towards intersectionality in machine learning: Including more identities, handling underrepresentation, and performing evaluation. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency.
- Zero-shot cross-lingual summarization via large language models.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
- Exploring the limits of chatgpt for query or aspect-based text summarization.
- Benchmarking large language models for news summarization.
- Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989, Copenhagen, Denmark. Association for Computational Linguistics.
- Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.