Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses
Abstract: LLM-based applications are graduating from research prototypes to products serving millions of users, influencing how people write and consume information. A prominent example is the appearance of Answer Engines: LLM-based generative search engines supplanting traditional search engines. Answer engines not only retrieve relevant sources to a user query but synthesize answer summaries that cite the sources. To understand these systems' limitations, we first conducted a study with 21 participants, evaluating interactions with answer vs. traditional search engines and identifying 16 answer engine limitations. From these insights, we propose 16 answer engine design recommendations, linked to 8 metrics. An automated evaluation implementing our metrics on three popular engines (You.com, Perplexity.ai, BingChat) quantifies common limitations (e.g., frequent hallucination, inaccurate citation) and unique features (e.g., variation in answer confidence), with results mirroring user study insights. We release our Answer Engine Evaluation benchmark (AEE) to facilitate transparent evaluation of LLM-based applications.
- Sabira Arefin. 2024. AI Revolutionizing Healthcare: Innovations, Challenges, and Ethical Considerations. MZ Journal of Artificial Intelligence 1, 2 (2024), 1–17.
- Carl Auerbach and Louise B Silverstein. 2003. Qualitative data: An introduction to coding and analysis. Vol. 21. NYU press.
- Emily M Bender. 2024. Resisting Dehumanization in the Age of “AI”. Current Directions in Psychological Science 33, 2 (2024), 114–120.
- On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 610–623.
- Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5454–5476.
- The impact of generative artificial intelligence on socioeconomic inequalities and policy making. PNAS nexus 3, 6 (2024).
- Kathy Charmaz. 2006. Constructing grounded theory: A practical guide through qualitative analysis. sage.
- Kathy Charmaz. 2017. Constructivist grounded theory. The Journal of Positive Psychology 12, 3 (2017), 299–300.
- Evaluating Top-k RAG-based approach for Game Review Generation. In 2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT), Vol. 5. IEEE, 258–263.
- Robert Cooper and Michael Foster. 1971. Sociotechnical systems. American Psychologist 26, 5 (1971), 467.
- Dipto Das. 2023. Studying Multi-dimensional Marginalization of Identity from Decolonial and Postcolonial Perspectives. In Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing. 437–440.
- The“Colonial Impulse” of Natural Language Processing: An Audit of Bengali Sentiment Analysis Tools and Their Identity-based Biases. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–18.
- Dipto Das and Bryan Semaan. 2022. Decolonial and Postcolonial Computing Research: A Scientometric Exploration. In Companion Publication of the 2022 Conference on Computer Supported Cooperative Work and Social Computing. 168–174.
- Axes for sociotechnical inquiry in AI research. IEEE Transactions on Technology and Society 2, 2 (2021), 62–70.
- A survey on rag meets llms: Towards retrieval-augmented large language models. arXiv preprint arXiv:2405.06211 (2024).
- The Who in XAI: How AI Background Shapes Perceptions of AI Explanations. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–32.
- Upol Ehsan and Mark O Riedl. 2020. Human-centered explainable ai: Towards a reflective sociotechnical approach. In HCI International 2020-Late Breaking Papers: Multimodality and Intelligence: 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings 22. Springer, 449–466.
- Human-Centered Explainable AI (HCXAI): Reloading Explainability in the Era of Large Language Models (LLMs). In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–6.
- RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. 150–158.
- Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217 (2023).
- A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 6491–6501.
- Emilio Ferrara. 2024. GenAI against humanity: Nefarious applications of generative artificial intelligence and large language models. Journal of Computational Social Science (2024), 1–21.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023).
- From melting pots to misrepresentations: Exploring harms in generative ai. arXiv preprint arXiv:2403.10776 (2024).
- Do Generative AI Models Output Harm while Representing Non-Western Cultures: Evidence from A Community-Centered Approach. arXiv preprint arXiv:2407.14779 (2024).
- Barney Glaser. 1992. Basics of grounded theory analysis: Emergence vs forcing. (1992).
- Barney Glaser and Anselm Strauss. 1967. Discovery of grounded theory: Strategies for qualitative research. Routledge.
- Eric Goldman. 2005. Search engine bias and the demise of search engine utopianism. Yale JL & Tech. 8 (2005), 188.
- David Grant. 2025. Populism, Artificial Intelligence and Law: A New Understanding of the Dynamics of the Present. Taylor & Francis.
- RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. arXiv preprint arXiv:2401.08406 (2024).
- Sociodemographic bias in language models: A survey and forward path. In Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP). 295–322.
- Jutta Haider and Olof Sundin. 2019. Invisible search and online search engines: The ubiquity of search in everyday life. Taylor & Francis.
- Donna Haraway. 2013. Situated knowledges: The science question in feminism and the privilege of partial perspective 1. In Women, science, and technology. Routledge, 455–472.
- Carol Mullins Hayes. 2023. Generative artificial intelligence and copyright: Both sides of the Black Box. Available at SSRN 4517799 (2023).
- Wayne Holmes and Ilkka Tuomi. 2022. State of the art and practice in AI in education. European Journal of Education 57, 4 (2022), 542–570.
- John E Hopcroft and Richard M Karp. 1973. An n^5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing 2, 4 (1973), 225–231.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 (2023).
- Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 7029–7043.
- Evaluating Large Language Models for Health-related Queries with Presuppositions. In Findings of the Association for Computational Linguistics ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand and virtual meeting, 14308–14331. https://aclanthology.org/2024.findings-acl.850
- FABLES: Evaluating faithfulness and content selection in book-length summarization. arXiv preprint arXiv:2404.01261 (2024).
- Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems. arXiv preprint arXiv:2407.01370 (2024).
- Llms as factual reasoners: Insights from existing benchmarks and beyond. arXiv preprint arXiv:2305.14540 (2023).
- Are you sure? challenging llms leads to performance drops in the flipflop experiment. arXiv preprint arXiv:2311.08596 (2023).
- SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics 10 (2022), 163–177.
- Pre-training via Paraphrasing. Advances in Neural Information Processing Systems 33 (2020), 18470–18481.
- Alice Li and Luanne Sinnamon. 2024. Generative AI Search Engines as Arbiters of Public Knowledge: An Audit of Bias and Authority. arXiv preprint arXiv:2405.14034 (2024).
- Nora Freya Lindemann. 2024. Chatbots, search engines, and the sealing of knowledges. AI & SOCIETY (2024), 1–14.
- Artificial intelligence as a service: classification and research directions. Business & Information Systems Engineering 63 (2021), 441–456.
- Evaluating Verifiability in Generative Search Engines. In Findings of the Association for Computational Linguistics: EMNLP 2023. 7001–7025.
- Are search engines biased? Detecting and reducing bias using meta search engines. Electronic Commerce Research and Applications (2022), 101132.
- Exploring think-alouds in usability testing: An international survey. IEEE Transactions on Professional Communication 55, 1 (2012), 2–19.
- Shahan Ali Memon and Jevin D West. 2024. Search engines post-ChatGPT: How generative artificial intelligence could make search less reliable. arXiv preprint arXiv:2402.11707 (2024).
- Abbe Mowshowitz and Akira Kawaguchi. 2005. Measuring search engine bias. Information processing & management 41, 5 (2005), 1193–1205.
- Rainer Mühlhoff. 2018. Digitale Entmündigung und User Experience Design. Leviathan 46, 4 (2018), 551–574.
- Having Beer after Prayer? Measuring Cultural Bias in Large Language Models. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:258865272
- Pranav Narayanan Venkit. 2023. Towards a holistic approach: Understanding sociodemographic biases in nlp models using an interdisciplinary lens. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. 1004–1005.
- Unmasking nationality bias: A study of human perception of nationalities in ai-generated articles. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. 554–565.
- Casey Newton. 2024. How to stop perplexity and save the web from bad ai. https://www.platformer.news/how-to-stop-perplexity-oreilly-ai-publishing/
- Mie Nørgaard and Kasper Hornbæk. 2006. What do usability evaluators do in practice? An explorative study of think-aloud testing. In Proceedings of the 6th conference on Designing Interactive systems. 209–218.
- Cathy O’neil. 2017. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown.
- GenAI and the Public Sector. In Empowering the Public Sector with Generative AI: From Strategy and Design to Real-World Applications. Springer, 31–43.
- A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023).
- Kylie Robison. 2024. Google promised a better search experience - now it’s telling us to put glue on our pizza. https://www.theverge.com/2024/5/23/24162896/google-ai-overview-hallucinations-glue-in-pizza
- Evaluation of RAG Metrics for Question Answering in the Telecom Domain. arXiv preprint arXiv:2407.12873 (2024).
- Eryk Salvaggio. 2024. Challenging the myths of Generative AI. https://www.techpolicy.press/challenging-the-myths-of-generative-ai/
- Deepa Seetharaman. 2024. https://www.wsj.com/tech/ai/openai-search-engine-searchgpt-97771f86
- Chirag Shah and Emily M Bender. 2024. Envisioning information access systems: What makes for good tools and a healthy Web? ACM Transactions on the Web 18, 3 (2024), 1–24.
- Generative Echo Chamber? Effect of LLM-Powered Search Systems on Diverse Information Seeking. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–17.
- Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics 11 (2023), 1–17.
- Elizabeth A St. Pierre and Alecia Y Jackson. 2014. Qualitative data analysis after coding. , 715–719 pages.
- MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents. arXiv preprint arXiv:2404.10774 (2024).
- The Sentiment Problem: A Critical Survey towards Deconstructing Sentiment Analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 13743–13763.
- ” Confidently Nonsensical?”: A Critical Survey on the Perspectives and Challenges of’Hallucinations’ in NLP. arXiv preprint arXiv:2404.07461 (2024).
- Nationality Bias in Text Generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 116–122.
- Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023). 26–34.
- PJ Vogt. 2024. How much glue should be in your pizza? https://pjvogt.substack.com/p/how-much-glue-should-be-in-your-pizza
- How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs’ internal prior. arXiv preprint arXiv:2404.10198 (2024).
- Elvin Wyly. 2014. Automated (post) positivism. Urban Geography 35, 5 (2014), 669–690.
- RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. arXiv preprint arXiv:2408.01262 (2024).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.