A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks

Published 13 Oct 2025 in cs.HC | (2510.11897v1)

Abstract: Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by LLMs to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.