Performance of Unsupervised Elicitation and Easy-to-Hard Generalization on Superhuman Tasks

Determine whether unsupervised elicitation techniques and easy-to-hard generalization methods for steering large language models maintain their observed performance on standard datasets when applied to real tasks that are beyond human capabilities.

Background

LLMs are commonly post-trained using human-provided labels or feedback, which can be unreliable for tasks that exceed human capability. To mitigate this, two families of approaches have emerged: easy-to-hard generalization (training on easier, labeled tasks with the hope of generalizing to harder ones) and unsupervised elicitation (steering models without ground-truth labels, e.g., via internal consistency or probing directions).

Although these approaches have shown promising results across a variety of benchmarks, the authors caution that such evaluations may be overoptimistic because they often lack real-world challenges like salient non-truth features, class imbalance, and tasks that models cannot decisively answer. The explicit question raised is whether these methods will perform as well on real tasks beyond human capabilities as they do on the curated datasets.

References

Though each of these approaches has been found to perform well on a variety of datasets, it is unclear whether they will perform as well when applied to real tasks which are beyond human capabilities.

— Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation (2602.20400 - Canavan et al., 23 Feb 2026) in Section 1 (Introduction)

Performance of Unsupervised Elicitation and Easy-to-Hard Generalization on Superhuman Tasks

Background

References

Related Problems