2000 character limit reached
Semantic Search Evaluation
Published 28 Oct 2024 in cs.IR and cs.CL | (2410.21549v1)
Abstract: We propose a novel method for evaluating the performance of a content search system that measures the semantic match between a query and the results returned by the search system. We introduce a metric called "on-topic rate" to measure the percentage of results that are relevant to the query. To achieve this, we design a pipeline that defines a golden query set, retrieves the top K results for each query, and sends calls to GPT 3.5 with formulated prompts. Our semantic evaluation pipeline helps identify common failure patterns and goals against the metric for relevance improvements.
- The relationship between IR effectiveness measures and user satisfaction. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 773–774.
- Multi-objective ranking optimization for product search using stochastic label aggregation. In Proceedings of The Web Conference 2020. 373–383.
- Practice and Challenges in Building a Business-oriented Search Engine Quality Metric. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3295–3299.
- Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 15–24.
- Top challenges from the first practical online controlled experiments summit. ACM SIGKDD Explorations Newsletter 21, 1 (2019), 20–35.
- Scott B Huffman and Michael Hochster. 2007. How well does result relevance predict session satisfaction?. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 567–574.
- Peeking at a/b tests: Why it matters, and what to do about it. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1517–1525.
- On application of learning to rank for e-commerce search. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 475–484.
- Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 26–37.
- When does relevance mean usefulness and user satisfaction in web search?. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 463–472.
- Llm4eval: Large language model for evaluation in ir. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3040–3043.
- Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024. arXiv preprint arXiv:2408.05388 (2024).
- LLMJudge: LLMs for Relevance Judgments. arXiv preprint arXiv:2408.08896 (2024).
- Mark Sanderson et al. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends® in Information Retrieval 4, 4 (2010), 247–375.
- LARR: Large Language Model Aided Real-time Scene Recommendation with Semantic Understanding. arXiv preprint arXiv:2408.11523 (2024).
- How well do offline metrics predict online performance of product ranking models?. In SIGIR 2023. https://www.amazon.science/publications/how-well-do-offline-metrics-predict-online-performance-of-product-ranking-models
- A setwise approach for effective and highly efficient zero-shot ranking with large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 38–47.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.