Open questions in designing AGI benchmarks

Ascertain whether the construction of Artificial General Intelligence (AGI) evaluation benchmarks should use human values as the starting point for test design or adopt alternative, non-human-centric perspectives, to guide the development of suitable AGI benchmarks that meaningfully assess AGI capabilities.

Background

The paper’s grand challenges section emphasizes that while many tasks can serve to evaluate LLMs, identifying which tasks genuinely measure AGI capabilities remains unresolved. The authors stress that understanding differences between human and AGI capacities is crucial for creating AGI benchmarks, and note a prevailing trend to conceptualize AGI as superhuman by leveraging cross-disciplinary knowledge.

Within this context, the authors explicitly pose a concrete uncertainty: whether AGI benchmark construction should start from human values or consider alternative perspectives. They then state that developing suitable AGI benchmarks entails many open questions, underscoring the need for further investigation into benchmark design choices that can reliably capture AGI capabilities.

References

For instance, does it make sense to use human values as a starting point for test construction, or should alternative perspectives be considered? Developing suitable AGI benchmarks presents many open questions demanding further exploration.

A Survey on Evaluation of Large Language Models  (2307.03109 - Chang et al., 2023) in Subsection “Designing AGI Benchmarks,” Section 7 (Grand Challenges and Opportunities for Future Research)