- The paper presents propositional interpretability as a key framework for understanding AI’s internal mechanisms by linking attitudes like belief and desire to structured propositions.
- It details methodologies such as causal tracing, probing classifiers, sparse auto-encoders, and chain-of-thought techniques to log and interpret AI thought processes.
- The study highlights ethical and safety implications by proposing thought logging systems to capture dynamic propositional attitudes in AI systems.
The paper "Propositional Interpretability in Artificial Intelligence" analyzes the program of mechanistic interpretability in AI systems, advocating for the importance of propositional interpretability. Propositional interpretability involves interpreting a system's mechanisms and behavior in terms of propositional attitudes such as belief, desire, or subjective probability, toward propositions. The author argues that understanding an AI system's goals (desires) and models of the world (beliefs) is crucial for AI safety and ethics. The central challenge is identified as thought logging, which involves creating systems that log all relevant propositional attitudes in an AI system over time.
The paper breaks down varieties of interpretability, distinguishing between explainability (for ordinary humans) and interpretability (for theorists). Interpretability is further divided into behavioral and mechanistic interpretability, with a focus on the latter. Mechanistic interpretability is subdivided into algorithmic and representational interpretability, with emphasis on representational interpretability, which includes conceptual and propositional interpretability.
Propositions are defined as structured entities composed of concepts, while propositional attitudes are attitudes toward propositions. Examples include believing that 2+2=4 or desiring that Australia win the Ashes. The author also discusses credence (subjective probability) and utility as numerical counterparts of belief and desire. Propositional attitudes are categorized as dispositional (inactive but activatable) and occurrent (active at a given time). The paper highlights the centrality of propositional attitudes in understanding human behavior and argues for their importance in understanding AI systems. It suggests a conceptual engineering project to refine categories of propositional attitudes and potentially move toward generalized propositional attitudes that go beyond traditional categories.
The paper draws parallels between interpretability projects in AI and human interpretation, referencing the philosophical program of "radical interpretation" by Donald Davidson and David Lewis. It introduces the concept of computational interpretation, which involves solving for an AI system's propositional attitudes given computational facts about the system. A concrete challenge is proposed: constructing a thought logging system that logs an AI system's propositional attitudes over time. The paper suggests a format for entries in a thought log, inspired by Lewis, and discusses extensions such as reason logging and mechanism logging.
The possibility of thought logging is linked to psychosemantic theories, which aim to provide physical conditions for having propositional attitudes. The paper distinguishes between the semantic and metasemantic branches of psychosemantics, with the latter offering theories of the conditions under which subjects have a given propositional attitude. It discusses different psychosemantic theories, including informational, causal, teleological, and inferential theories, emphasizing the roles of information and use in determining mental content.
The author assesses current methods for propositional interpretability, including causal tracing, probing with classifiers, sparse auto-encoders, and chain of thought methods.
- Causal tracing: This method involves localizing "facts" or "knowledge" in a neural network by corrupting input activations and restoring "clean" activations to determine important layers. It relies on use rather than information as a criterion for what is represented. Limitations include robustness, open-endedness, and applicability to attitudes beyond belief.
- Probing with classifiers: This method involves training classifiers to decode activity patterns in neural networks to determine whether a given set of units represents a feature. It can be used to decode propositional content but is limited by its supervised nature and reliance on ground truth. Recent work addresses these limitations with compositional probing methods that exploit the compositional structure of propositions.
- Sparse auto-encoders: These are used to generate features that may be active or represented in LLMs. A sparse auto-encoder is a two-layer neural network that encodes activation vectors as sparse vectors, with the hypothesis that many units correspond to interpretable features or concepts. This method supports feature logging and concept logging, potentially leading to thought logging.
- Chain of thought methods: These methods involve training LLMs to "think out loud" by asserting intermediate conclusions. While chain of thought outputs are pre-interpreted and propositional, they are often unfaithful and incomplete reflections of internal processes.
The paper addresses objections and challenges to propositional interpretability, including the argument that AI systems cannot have propositional attitudes. It suggests adopting a project of nonmentalistic interpretability to evade these objections and discusses the potential for mentalistic interpretability to determine whether AI systems have genuine mental states. The author also addresses concerns about the explanatory framework of propositional attitudes, the differences between AI and human psychology, and the difficulties posed by externalism. The paper concludes by considering ethical implications of thought logging and emphasizing that propositional interpretability is a useful tool in AI safety.