Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging
Abstract: The rapid advancement of LLMs has given rise to a plethora of applications across a myriad of real-world tasks, mainly centered on aligning with human intent. However, the complexities inherent in human intent necessitate a dependence on labor-intensive and time-consuming human evaluation. To alleviate this constraint, we delve into the paradigm of employing open-source LLMs as evaluators, aligning with the prevailing trend of utilizing GPT-4. Particularly, we present a step-by-step evaluation framework: \textbf{Fennec}, capable of \textbf{F}ine-grained \textbf{E}valuatio\textbf{N} and correctio\textbf{N} \textbf{E}xtended through bran\textbf{C}hing and bridging. Specifically, the branching operation dissects the evaluation task into various dimensions and granularities, thereby alleviating the challenges associated with evaluation. Concurrently, the bridging operation amalgamates diverse training datasets, augmenting the variety of evaluation tasks. In experimental trials, our 7B model consistently outperforms open-source larger-scale evaluation models across various widely adopted benchmarks in terms of both \textit{Agreement} and \textit{Consistency}, closely approaching the capabilities of GPT-4. We employ the fine-grained correction capabilities induced by the evaluation model to refine multiple model responses, and the results show that the refinement elevates the quality of responses, leading to an improvement of 1-2 points on the MT-Bench. Our code is available at Github\footnote{\url{https://github.com/dropreg/Fennec}}.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
- Purr: Efficiently editing language model hallucinations by denoising language model corruptions. arXiv preprint arXiv:2305.14908.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Hello dolly: Democratizing the magic of chatgpt with open models.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- Koala: A dialogue model for academic research. Blog post.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Tigerscore: Towards building explainable metric for all text generation tasks. arXiv preprint arXiv:2310.00752.
- Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
- Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- X-eval: Generalizable multi-aspect text evaluation via augmented instruction tuning with auxiliary evaluation aspects. arXiv preprint arXiv:2311.08788.
- A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Lin CY ROUGE. 2004. A package for automatic evaluation of summaries. In Proceedings of Workshop on Text Summarization of ACL, Spain, volume 5.
- Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123.
- Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
- Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.