Existence of a globally tunable introspection faculty in instruction-tuned LLMs

Determine whether there exists a single, globally tunable introspective faculty in instruction-tuned large language models such that steering along one universal representation direction consistently improves the coupling between logit-based numeric self-reports and linear probe–defined internal emotive states across multiple concepts (e.g., wellbeing, interest, focus, and impulsivity), rather than only yielding local, pair-specific improvements.

Background

Cross-concept steering experiments showed that activation steering can improve introspective fidelity, but only in specific concept pairs (e.g., focus→wellbeing and impulsivity→interest), with no evidence for a single intervention that improves introspection across all concepts.

The authors therefore highlight that while local, pair-specific improvements are demonstrable, the existence of a single, general “introspection knob” remains unresolved and merits explicit investigation as an open problem.

References

The current evidence therefore points toward local, pair-specific improvement rather than a single globally tunable faculty. However, future work should still treat the search of this global faculty as an open problem.

— Quantitative Introspection in Language Models: Tracking Internal States Across Conversation (2603.18893 - Martorell, 19 Mar 2026) in Section 6.3 (Introspection is concept-specific and only locally improvable in our experiments)

Existence of a globally tunable introspection faculty in instruction-tuned LLMs

Background

References

Related Problems