Myhill–Nerode Evaluation Metrics
- The framework introduces a formal method that quantifies a model’s ability to recover DFA-driven logical structures via equivalence-class reconciliation and distinguishing-extension scores.
- It employs precision and recall metrics to compare model-generated suffixes against ground-truth distinguishing extensions derived from automata theory.
- Empirical results in navigation, board games, and logic puzzles illustrate that high next-token accuracy can mask deficiencies in true world-model coherence.
Myhill–Nerode–inspired evaluation metrics are a formal framework for assessing to what extent a generative model’s implicit world model recovers the underlying logical structure of a deterministic finite automaton (DFA). These metrics, introduced by Vafa et al. (2024), leverage foundational results from automata theory—specifically the Myhill–Nerode theorem—and yield two quantitative tests: equivalence-class reconciliation (sequence compression) and distinguishing-extension score (sequence distinction). They are designed to reveal whether a generative model exhibits genuinely coherent internal representations beyond what is captured by conventional diagnostics such as next-token accuracy or state probing (Vafa et al., 2024).
1. Theoretical Foundations and Myhill–Nerode Theorem
The classical Myhill–Nerode theorem provides a canonical way to partition the set of string histories with respect to a regular language : two strings are equivalent () if all their possible continuations yield the same membership in . The number of equivalence classes is finite if and only if is regular.
For a minimal DFA recognizing , each state corresponds to an equivalence class, and specific distinguishing extensions can be found to separate any two distinct classes . This motivates two desiderata for a generative model that is expected to capture the world’s dynamics as formalized by :
- Histories leading to the same DFA state should admit indistinguishable possible continuations according to .
- Histories leading to different DFA states should admit distinguishable continuations.
"World-model recovery" is defined as generating exactly the possible continuations specified by at every hidden state. Formally, recovers if, for all , all , and all ,
where denotes the dead (undefined) state.
2. Definitions of Boundary and Metrics
Let ("Myhill–Nerode interior") denote the set of continuations valid from both and . Let ("Myhill–Nerode boundary") denote the minimal distinguishing suffixes—i.e., strings that are accepted from but not , and all proper prefixes of are accepted from both.
A generative model defines an analogous model boundary: is the set of minimal suffixes that can be generated from (passing a tokenwise threshold at each next token) but not from , with all proper prefixes accepted by both.
For any pair , one then computes:
- Boundary precision: The fraction of model-proposed distinguishing suffixes truly distinguishing according to :
- Boundary recall: The fraction of true distinguishing suffixes discovered by the model:
The two main derived metrics are:
- Equivalence-class reconciliation (compression precision): The average precision for pairs leading to the same DFA state (), assessing whether the model invents spurious distinctions.
- Distinguishing-extension score: The precision and recall for pairs with , quantifying the model’s ability to maintain true distinctions.
An aggregate coherence measure can be used for model comparison or diagnostics.
3. Computational Procedures
The computation of these metrics requires both ground-truth (DFA-derived) and model-implied boundaries. Full enumeration of all suffixes is infeasible for any nontrivial , so the procedures employ bounded enumeration (maxLen ) and Monte Carlo sampling (typically ) with a generation threshold . The three principal routines are:
- TrueBoundary(, , ): Enumerates all suffixes up to length that are minimal distinguishing extensions, per the DFA.
- ModelBoundary(, , , ): Samples bounded-length continuations from and identifies minimal suffixes accepted from but not from .
- EvaluateCompression, EvaluateDistinction: Aggregate averaged statistics over randomly sampled prefixes mapping to DFA states.
The specific pseudocode used for these procedures is provided in the originating source (Vafa et al., 2024).
4. Empirical Applications
Myhill–Nerode–inspired evaluation metrics have been illustrated in three diverse domains:
- Navigation (Manhattan taxi rides): DFA states correspond to intersections in a street graph; actions are cardinal directions. Generative transformer models, trained on real trajectory data with various noise profiles, are evaluated on their ability to capture compositionality and distinct route logic.
- Othello: DFA states represent partial board configurations or game histories. Transformer models trained on real and synthetic games are assessed for world model coherence.
- Logic puzzles ("seating arrangements"): Here, DFA states correspond to sets of possible seatings consistent with given statements. LLMs (Llama-2/3, Mixtral, Qwen, GPT-3.5/4) are compared on puzzle accuracy and the two metrics.
Key results highlight divergences between classic diagnostics and the new metrics:
| Model | Next-Token | Probe | Comp. Prec. | Dist. Prec. | Dist. Rec. |
|---|---|---|---|---|---|
| Shortest-path (Nav.) | 100% | 91% | 19% | 36% | 26% |
| Othello champion | – | – | 0% | 65% | 27% |
| Llama-2 (70B logic) | – | – | 8% | – | 42% |
| True world model | 100% | – | 100% | 100% | 100% |
This demonstrates that “next-token” and “probe” metrics may dramatically overstate world-model coherence; models can reach high accuracy but still fail to recover DFA-level distinctions or merge logically distinct classes.
5. Analysis and Interpretation
The two metrics serve complementary diagnostic purposes:
- Equivalence-class reconciliation penalizes models that artificially differentiate histories leading to the same latent state, indicating overfragmentation or spurious distinctions.
- Distinguishing-extension score isolates failures to recognize or maintain distinctions mandated by the world structure (under-partitioning or state pooling).
A model with perfect next-token prediction can nevertheless perform poorly on these measures, indicating a brittle or incoherent internal logic. Models exposed to greater data diversity (e.g., random walks in navigation or synthetic Othello games) show marked improvements in both metrics, suggesting that training regime significantly impacts world-model coherence.
Such metrics also predict model fragility: navigation models with low coherence fail when tested on tasks involving out-of-distribution detours—reflecting their inability to generalize the underlying map structure.
6. Worked Example and Practical Implications
Consider Manhattan navigation: two prefixes, and , end at the same intersection but follow different routes. The model’s predicted continuations should agree if its internal state representation is coherent. Failing this, artificial distinctions introduced by the model are caught by the equivalence-class reconciliation test (compression precision). Conversely, prefixes that lead to different intersections (distinct DFA states) should have at least one continuation distinguishing them, as verified by the distinguishing-extension score.
A plausible implication is that next-token metrics alone are insufficient for characterizing whether a generative model has genuinely internalized a domain’s compositional and logical structure. Myhill–Nerode–inspired metrics offer a rigorous path for quantifying and comparing such structural knowledge (Vafa et al., 2024).
7. Limitations and Scope of Applicability
The evaluation framework presupposes the existence of a ground-truth DFA and mapping between model histories and DFA states. The computational approximation—bounded suffix length, Monte Carlo sampling, and generation threshold—are necessary for efficiency but may omit rare or long-horizon distinctions.
The metrics are currently formulated for domains with an identifiable automaton structure (finite state, deterministic transitions). Extension to broader classes of generative models or richer world-logics requires further theoretical development.
References: Vafa et al. (2024) (Vafa et al., 2024); Myhill (1957); Nerode (1958); Sipser (2013).