Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?

Published 20 Oct 2024 in cs.CL | (2410.15512v2)

Abstract: Question answering (QA), giving correct answers to questions, is a popular task, but we test reverse question answering (RQA): for an input answer, give a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and checking reasoning consistency. We run 16 LLMs on QA and RQA with trivia questions/answers, revealing: 1) Versus QA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not from knowledge gaps alone; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to provide valid multi-hop questions. By finding question and answer types that lead to RQA errors, we suggest improvements for LLM reasoning.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs show significant discrepancies in reverse question answering, especially when handling numerical answers with performance gaps over 0.80.
The study uses a dataset of 3443 trivia pairs to reveal that many RQA errors stem from flawed question formulation rather than knowledge deficits.
The paper identifies strong error correlations with question difficulty and answer frequency, highlighting areas for improvement in abductive reasoning and self-verification.

Insights on Reverse Question Answering: An Analysis

The paper "Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?" explores the intriguing challenge of Reverse Question Answering (RQA), evaluating whether LLMs can generate a valid question from a given answer. This study distinguishes RQA from traditional Question Answering (QA) tasks by emphasizing reasoning strategies and consistency across various answer types, thus shedding light on the abduction and deduction reasoning capabilities of LLMs.

Key Contributions

The authors explore 16 LLMs across four distinct answer domains: numerical answers (Numbers and Number+Text) and textual answers (Easy and Hard Facts). A comprehensive dataset of 3443 trivia question-answer pairs is utilized to test both QA and RQA tasks, employing a carefully designed evaluation setup. A central focus of the study is to uncover disparities in LLM performance between these tasks and identify underlying issues that influence the model's ability to reason and self-verify.

Findings

Performance Disparities:
- Numerical Answers: LLMs exhibit significantly lower accuracy in RQA compared to QA when the answers are numerical, with performance discrepancies exceeding 0.80 for some models. This clearly highlights a substantial limitation in abductive reasoning capabilities for numerical data.
- Textual Answers: Conversely, models perform slightly better in RQA than QA for textual answers, suggesting a domain-specific advantage in generating questions from known entities.
Logical Consistency:
- Upon analyzing the consistency of results when chaining RQA and QA tasks, a trend emerges in which models frequently identify and answer their own invalid RQA-generated questions correctly in QA. This indicates that many RQA failures are not purely due to knowledge deficiencies but could stem from improper question formulation.
Error Correlation:
- The study finds correlations between RQA errors and factors like question difficulty and answer frequency. Particularly, RQA mistakes in Number+Text categories arise more with rare entities, while overly complex multi-hop questions often lead to errors in the Numbers domain.

Implications and Future Work

The implications of these findings are twofold: from a theoretical perspective, the paper challenges assumptions about LLM reasoning capabilities, particularly highlighting weaknesses in numerical domains. Practically, it advises enhancing LLMs through calibrated model tuning and training data modifications that mitigate bias towards generating complex and inaccurate questions.

For future research, the analysis suggests creating more abductive reasoning benchmarks and improving self-verification mechanisms to augment LLM robustness in generating valid, answer-verifiable questions. This dual focus on reasoning and consistency promises to enhance the utility of LLMs in educational, brainstorming, and exam generation contexts.

In conclusion, this paper provides a critical examination of the RQA task, offering insights into the domain-specific abilities and limitations of current LLMs. By pinpointing areas of weakness, it sets the stage for future advancements in the logical coherence and reasoning processes of AI LLMs.