Designing AGI Benchmarks

Identify tasks that can truly measure Artificial General Intelligence capabilities and ascertain whether human values should serve as the basis for constructing AGI benchmark tests or whether alternative perspectives are more appropriate, in order to guide the development of suitable AGI benchmarks.

Background

The survey argues that while many tasks can serve to evaluate LLMs, it remains unsettled which tasks genuinely assess AGI capabilities. The authors highlight the trend of conceptualizing AGI as a superhuman entity and suggest leveraging cross-disciplinary knowledge (education, psychology, social sciences) to design benchmarks.

However, they explicitly point out unresolved issues and pose a concrete question about whether human values should be the starting point for test construction or if alternative perspectives are more appropriate. They emphasize that developing suitable AGI benchmarks involves many open questions requiring further exploration.

References

As we discussed earlier, while all tasks can potentially serve as evaluation tools for LLMs, the question remains as to which can truly measure AGI capabilities. Nonetheless, there remains a plethora of unresolved issues. For instance, does it make sense to use human values as a starting point for test construction, or should alternative perspectives be considered? Developing suitable AGI benchmarks presents many open questions demanding further exploration.

A Survey on Evaluation of Large Language Models  (2307.03109 - Chang et al., 2023) in Section 7.1 (Grand Challenges: Designing AGI Benchmarks)