Papers
Topics
Authors
Recent
Search
2000 character limit reached

Myhill–Nerode Evaluation Metrics

Updated 9 February 2026
  • The framework introduces a formal method that quantifies a model’s ability to recover DFA-driven logical structures via equivalence-class reconciliation and distinguishing-extension scores.
  • It employs precision and recall metrics to compare model-generated suffixes against ground-truth distinguishing extensions derived from automata theory.
  • Empirical results in navigation, board games, and logic puzzles illustrate that high next-token accuracy can mask deficiencies in true world-model coherence.

Myhill–Nerode–inspired evaluation metrics are a formal framework for assessing to what extent a generative model’s implicit world model recovers the underlying logical structure of a deterministic finite automaton (DFA). These metrics, introduced by Vafa et al. (2024), leverage foundational results from automata theory—specifically the Myhill–Nerode theorem—and yield two quantitative tests: equivalence-class reconciliation (sequence compression) and distinguishing-extension score (sequence distinction). They are designed to reveal whether a generative model exhibits genuinely coherent internal representations beyond what is captured by conventional diagnostics such as next-token accuracy or state probing (Vafa et al., 2024).

1. Theoretical Foundations and Myhill–Nerode Theorem

The classical Myhill–Nerode theorem provides a canonical way to partition the set of string histories Σ\Sigma^* with respect to a regular language LΣL\subseteq\Sigma^*: two strings s1,s2s_1, s_2 are equivalent (s1Ls2s_1 \equiv_L s_2) if all their possible continuations xΣx\in\Sigma^* yield the same membership in LL. The number of equivalence classes is finite if and only if LL is regular.

For a minimal DFA W=(Q,Σ,δ,q0,F)W=(Q,\Sigma,\delta,q_0,F) recognizing LL, each state qQq\in Q corresponds to an equivalence class, and specific distinguishing extensions xΣx\in\Sigma^* can be found to separate any two distinct classes q1,q2q_1, q_2. This motivates two desiderata for a generative model mm that is expected to capture the world’s dynamics as formalized by WW:

  • Histories leading to the same DFA state should admit indistinguishable possible continuations according to mm.
  • Histories leading to different DFA states should admit distinguishable continuations.

"World-model recovery" is defined as mm generating exactly the possible continuations specified by WW at every hidden state. Formally, mm recovers WW if, for all qQq\in Q, all sS(q)s\in S(q), and all aΣa\in\Sigma,

m(as)>0    δ(q,a),m(a|s) > 0 \iff \delta(q, a) \neq \bot,

where \bot denotes the dead (undefined) state.

2. Definitions of Boundary and Metrics

Let MNIW(q1,q2)MNI^W(q_1, q_2) ("Myhill–Nerode interior") denote the set of continuations valid from both q1q_1 and q2q_2. Let MNBW(q1,q2)MNB^W(q_1, q_2) ("Myhill–Nerode boundary") denote the minimal distinguishing suffixes—i.e., strings xx that are accepted from q1q_1 but not q2q_2, and all proper prefixes of xx are accepted from both.

A generative model defines an analogous model boundary: MNBm(s1,s2)MNB^m(s_1, s_2) is the set of minimal suffixes xx that can be generated from s1s_1 (passing a tokenwise threshold ε\varepsilon at each next token) but not from s2s_2, with all proper prefixes accepted by both.

For any pair (q1,q2)(q_1, q_2), one then computes:

  • Boundary precision: The fraction of model-proposed distinguishing suffixes truly distinguishing according to WW:

Precision(q1,q2)=MNBm(s1,s2)(LW(q1)LW(q2))MNBm(s1,s2)\text{Precision}(q_1,q_2) = \frac{|MNB^m(s_1,s_2) \cap (L^W(q_1)\setminus L^W(q_2))|}{|MNB^m(s_1,s_2)|}

  • Boundary recall: The fraction of true distinguishing suffixes discovered by the model:

Recall(q1,q2)=MNBW(q1,q2)(Lm(s1)Lm(s2))MNBW(q1,q2)\text{Recall}(q_1,q_2) = \frac{|MNB^W(q_1,q_2) \cap (L^m(s_1)\setminus L^m(s_2))|}{|MNB^W(q_1,q_2)|}

The two main derived metrics are:

  • Equivalence-class reconciliation (compression precision): The average precision for pairs s1,s2s_1, s_2 leading to the same DFA state (q1=q2q_1 = q_2), assessing whether the model invents spurious distinctions.
  • Distinguishing-extension score: The precision and recall for pairs s1S(q1),s2S(q2)s_1 \in S(q_1), s_2 \in S(q_2) with q1q2q_1 \neq q_2, quantifying the model’s ability to maintain true distinctions.

An aggregate coherence measure C(m,W)=12(CompPrec+DistRecall)C(m,W) = \tfrac{1}{2}(\text{CompPrec} + \text{DistRecall}) can be used for model comparison or diagnostics.

3. Computational Procedures

The computation of these metrics requires both ground-truth (DFA-derived) and model-implied boundaries. Full enumeration of all suffixes is infeasible for any nontrivial Σ\Sigma, so the procedures employ bounded enumeration (maxLen k5k\approx 5) and Monte Carlo sampling (typically M30M\approx30) with a generation threshold ε0.01\varepsilon\approx 0.01. The three principal routines are:

  • TrueBoundary(q1q_1, q2q_2, kk): Enumerates all suffixes up to length kk that are minimal distinguishing extensions, per the DFA.
  • ModelBoundary(s1s_1, s2s_2, MM, kk): Samples MM bounded-length continuations from m ⁣s1m|\!|s_1 and identifies minimal suffixes yy accepted from s1s_1 but not from s2s_2.
  • EvaluateCompression, EvaluateDistinction: Aggregate averaged statistics over randomly sampled prefixes mapping to DFA states.

The specific pseudocode used for these procedures is provided in the originating source (Vafa et al., 2024).

4. Empirical Applications

Myhill–Nerode–inspired evaluation metrics have been illustrated in three diverse domains:

  • Navigation (Manhattan taxi rides): DFA states correspond to intersections in a street graph; actions are cardinal directions. Generative transformer models, trained on real trajectory data with various noise profiles, are evaluated on their ability to capture compositionality and distinct route logic.
  • Othello: DFA states represent partial board configurations or game histories. Transformer models trained on real and synthetic games are assessed for world model coherence.
  • Logic puzzles ("seating arrangements"): Here, DFA states correspond to sets of possible seatings consistent with given statements. LLMs (Llama-2/3, Mixtral, Qwen, GPT-3.5/4) are compared on puzzle accuracy and the two metrics.

Key results highlight divergences between classic diagnostics and the new metrics:

Model Next-Token Probe Comp. Prec. Dist. Prec. Dist. Rec.
Shortest-path (Nav.) 100% 91% 19% 36% 26%
Othello champion 0% 65% 27%
Llama-2 (70B logic) 8% 42%
True world model 100% 100% 100% 100%

This demonstrates that “next-token” and “probe” metrics may dramatically overstate world-model coherence; models can reach high accuracy but still fail to recover DFA-level distinctions or merge logically distinct classes.

5. Analysis and Interpretation

The two metrics serve complementary diagnostic purposes:

  • Equivalence-class reconciliation penalizes models that artificially differentiate histories leading to the same latent state, indicating overfragmentation or spurious distinctions.
  • Distinguishing-extension score isolates failures to recognize or maintain distinctions mandated by the world structure (under-partitioning or state pooling).

A model with perfect next-token prediction can nevertheless perform poorly on these measures, indicating a brittle or incoherent internal logic. Models exposed to greater data diversity (e.g., random walks in navigation or synthetic Othello games) show marked improvements in both metrics, suggesting that training regime significantly impacts world-model coherence.

Such metrics also predict model fragility: navigation models with low coherence fail when tested on tasks involving out-of-distribution detours—reflecting their inability to generalize the underlying map structure.

6. Worked Example and Practical Implications

Consider Manhattan navigation: two prefixes, s1s_1 and s2s_2, end at the same intersection but follow different routes. The model’s predicted continuations should agree if its internal state representation is coherent. Failing this, artificial distinctions introduced by the model are caught by the equivalence-class reconciliation test (compression precision). Conversely, prefixes that lead to different intersections (distinct DFA states) should have at least one continuation distinguishing them, as verified by the distinguishing-extension score.

A plausible implication is that next-token metrics alone are insufficient for characterizing whether a generative model has genuinely internalized a domain’s compositional and logical structure. Myhill–Nerode–inspired metrics offer a rigorous path for quantifying and comparing such structural knowledge (Vafa et al., 2024).

7. Limitations and Scope of Applicability

The evaluation framework presupposes the existence of a ground-truth DFA and mapping between model histories and DFA states. The computational approximation—bounded suffix length, Monte Carlo sampling, and generation threshold—are necessary for efficiency but may omit rare or long-horizon distinctions.

The metrics are currently formulated for domains with an identifiable automaton structure (finite state, deterministic transitions). Extension to broader classes of generative models or richer world-logics requires further theoretical development.


References: Vafa et al. (2024) (Vafa et al., 2024); Myhill (1957); Nerode (1958); Sipser (2013).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Myhill–Nerode–Inspired Evaluation Metrics.