Minimal FSM Identification

Updated 24 January 2026

Minimum FSM identification is the process of deducing the smallest finite-state machine that accurately represents observed input–output or symbolic traces while satisfying temporal constraints.
It encompasses both deterministic and probabilistic models, with NP-hard complexity driving research into exact methods like integer programming and SAT-based approaches.
Techniques such as clique-cover reformulation and state-merging heuristics are employed to balance rigorous minimality with practical scalability in various applications.

Minimum FSM Identification addresses the task of inferring, from finite observed data or behavioral scenarios, a minimal finite-state machine (FSM) that is consistent with the empirical input–output or symbolic traces and, possibly, prescribed temporal properties. This problem arises in fields such as formal verification, time-series analysis, software synthesis, and statistical learning, where representational parsimony is requisite for interpretability, synthesis, or generalization. The problem encompasses both deterministic and probabilistic FSMs, with additional constraints arising from completeness, unifilarity, and logical specification, depending on domain requirements. Minimum FSM identification is provably $\mathrm{NP}$ -hard in its key variants, making scalability and approximation guarantees central research challenges (Paulson et al., 2014, Ulyantsev et al., 2016).

1. Formal Problem Definitions

Two canonical formulations define the minimal FSM identification problem on finite data sets:

Probabilistic Finite-State Machine (PFSM) on Symbol Sequences:

Given a finite alphabet $\Sigma$ and an observed sample $y = y_1, \ldots, y_N \in \Sigma^N$ , parameterize a history length $L$ to form the set $W = \{w_1, \ldots, w_n\}$ of distinct length- $L$ substrings. The empirical next-symbol distribution for substring $w$ is $f_{w|y}(a) = \frac{\#(wa \, \mathrm{in}\, y)}{\#(w \,\mathrm{in}\, y)}$ for $a \in \Sigma$ . A PFSM is a tuple $M = (Q, \Sigma, \delta, p)$ where $Q = \{q_1, \ldots, q_m\}$ , $\delta \subseteq Q \times \Sigma \times Q$ (unifilar transitions), $p : \delta \to [0,1]$ (probabilities), and outgoing transitions from $q$ sum to $1$. The task is to assign each $w_i \in W$ to a state $q_j$ such that all $w_i$ in $q_j$ have statistically indistinguishable next-symbol distributions and induced transitions are unifilar, minimizing $|Q|$ (Paulson et al., 2014).

Deterministic FSM Identification from Scenarios and Temporal Properties:

Given a set $S = \{s_1, \ldots, s_n\}$ of input–output scenarios (each as $(e_1, A_1), \ldots, (e_k, A_k)$ with events $e_i \in E$ and outputs $A_i \subseteq Z^*$ ), a conjunction of LTL formulae $\Phi$ on atomic propositions (e.g., wasEvent $(e)$ , wasAction $(a)$ ), and target cardinality $N$ , find a Mealy-style FSM $M = (Q, q_0, E, Z, \delta, \lambda)$ with $|Q| = N$ that precisely reproduces traces in $S$ and whose induced Kripke structure $K_M$ satisfies $K_M \models_A \Phi$ . The minimal $N$ for which such $M$ exists is sought (Ulyantsev et al., 2016).

2. NP-Hardness and Complexity Landscape

The minimum-state identification problem under both probabilistic and deterministic paradigms is $\mathrm{NP}$ -hard. For PFSMs, the hardness follows from a reduction from Minimum Clique Cover: the substrings’ statistical equivalence graph induces cliques, and minimizing the number of states amounts to finding a minimum clique cover. Unifilarity (deterministic transitions) does not decrease complexity; a polynomial-time “data gadget” construction encodes arbitrary graphs into substrings and transitions, showing that even with this constraint, the optimal PFSM learning problem remains $\mathrm{NP}$ -hard. Similarly, minimum FSM identification from scenarios and LTL is $\mathrm{NP}$ -hard via reductions involving DFA minimization given incomplete data samples and bounded LTL synthesis subproblems (Paulson et al., 2014, Ulyantsev et al., 2016).

This demonstrates that polynomial-time exact algorithms are infeasible unless P=NP, motivating both efficient exact algorithms for moderate sizes and practical approximation heuristics for larger instances.

3. Exact Algorithms and Integer Programming

For PFSMs, the exact minimal-state assignment is formulated as a binary integer program (called MSDpFSA) with variables $x_{ij}$ (substring $i$ assigned to state $j$ ), $\mu_{i\ell}$ (empirical distributions passing statistical test), $z_{i\ell}^\sigma$ (extension via symbol $\sigma$ ), $y_{jk}^\sigma$ (transition usage), and $p_j$ (state usage indicator). The objective is $\min \sum_{j=1}^n p_j$ , with constraints enforcing (i) full assignment, (ii) statistical equivalence, (iii) transition correctness, (iv) unifilarity, and (v) state indicator linkage. Solving this IP via branch-and-bound produces provably minimal-state PFSMs, though performance deteriorates rapidly for increasing alphabet and data size.

A more efficient “clique-cover reformulation” enumerates maximal cliques (via Bron–Kerbosch) in the equivalence graph, then solves a reduced IP to cover all substrings with cliques. Subsequent polynomial unifilar refinement ensures correct transitions; splitting cliques if two members diverge on the same symbol. This method remains exact and runs in fractions of a second for binary alphabets and samples up to $N = 10^4$ (Paulson et al., 2014).

For FSM identification from input–output scenarios and temporal constraints, four exact methods have been developed:

Iterative SAT-based approach encodes the mapping from scenario-tree nodes to states $x_{v,i}$ , transition usage $y_{i_1,i_2,e}$ , and output occurrence $z_{i,a,e}$ . Counterexample prohibition (from model checking with respect to $\Phi$ ) is incrementally encoded, and BFS-based symmetry breaking predicates are included. This method leverages incremental SAT solvers and is the most scalable for moderate $N$ .
QSAT-based approach incorporates bounded model checking into a QBF, with universal path variables ensuring satisfaction across traces up to depth $k$ .
Exponential SAT and backtracking offer complementary routes, with backtracking well-suited for small $|S|$ and resource-constrained settings. These approaches guarantee minimality if feasible within computational limits (Ulyantsev et al., 2016).

4. Polynomial-Time Approximations and Heuristics

For PFSMs, the classical polynomial-time heuristic is CSSR (Causal-State Splitting and Reconstruction). It proceeds by growing states from sliding-window histories of length up to $L$ , clustering histories into existing states contingent on their passing a statistical equivalence test (e.g., KS or $\chi^2$ ) at threshold $\alpha$ , and creating new states otherwise. Reconstruction enforces unifilarity, splitting states with nondeterministic outgoing transitions. CSSR exhibits linear complexity in sample size $N$ but exponential dependence on $|\Sigma|$ and window length $L$ . Though CSSR is asymptotically consistent, for finite $N$ it frequently overestimates the number of states due to limited sample discrimination, and no worst-case approximation ratio is known. Other heuristics (Bayesian merging, subtree merging) share these limitations (Paulson et al., 2014).

For scenario-based deterministic FSM identification, state-merging heuristics such as Blue-Fringe do not guarantee minimality, offering rapid synthesis for positive scenarios but lacking proofs of optimality (Ulyantsev et al., 2016).

5. Empirical Performance

Empirical comparisons highlight distinct algorithmic advantages and trade-offs. For PFSMs on $N \approx 10$ , $|\Sigma| = 2$ , direct IP solvers require $O(10^2)$ seconds, whereas the clique-cover approach scales near-linearly in $N$ and resolves samples up to $N=10^4$ in $0.01$–$0.2$ seconds for binary alphabets; for $|\Sigma|=3$ and $N=100$ runtimes remain under minutes. CSSR delivers the fastest runtime for $|\Sigma|\geq3$ but often yields larger nonminimal models (Paulson et al., 2014).

In scenario-based FSM synthesis, an open-source Java tool implements the four exact methods. Benchmarking on case studies and randomly-generated models shows that the Iterative SAT-based method (using CryptoMiniSat) solves instances up to $N=18$ states (case studies), all random complete and incomplete instances up to $N=12$ , and achieves superior scalability compared to exponential SAT, QSAT, and backtracking. Backtracking is viable for small $|S|$ , but QSAT solvers may lag in practice. Table 1 summarizes the empirical solved-instance bounds:

Method	Case max states	Random complete up to (states)	Random incomplete up to (states)
Iterative SAT	18	12	12
Exponential SAT	9	12 (90%)	12 (70%)
Backtracking	9	9 (80%)	9 (60%)
QSAT	3–4	$<30\%$ for $N>6$	similar

(Ulyantsev et al., 2016)

6. Guidelines and Implications for Method Selection

Selecting an algorithm for minimum FSM identification depends critically on desired guarantees, problem size, and specification features:

For certified minimality and moderate state spaces ( $N \leq 12$ ), exact algorithms—especially the Iterative SAT-based method—are preferred.
For very small $|S|$ and limited resources, backtracking is computationally tractable.
Where only positive scenarios and speed are required, state-merging heuristics such as Blue-Fringe suffice, but without minimality certification.
In PFSM learning, the clique-cover reformulation provides an exact solution for moderate alphabets and a practical benchmark for heuristic comparison; the direct IP method serves for small instances requiring formal certificates.
Going forward, analyzing approximation ratios of heuristics such as CSSR or developing new greedy and LP-rounding algorithms with provable bounds on state counts are natural research directions (Paulson et al., 2014, Ulyantsev et al., 2016).

A plausible implication is that empirical model parsimony can only be guaranteed for modest-scale problems; for larger samples or alphabets, approximation and benchmarking strategies are required.

7. Key Insights and Outlook

Minimum-state PFSM identification from finite data and minimum deterministic FSM identification from scenarios and temporal constraints are both $\mathrm{NP}$ -hard. No polynomial-time exact learner exists unless P=NP.
Integer programming (for PFSMs) and SAT-based encodings (for scenario-based FSMs) give precise, certificate-enabled solutions for small to moderate instance sizes.
Clique-cover reformulation yields an efficient, exact approach for symbolic data-driven PFSM learning, balancing scalability and optimality.
Polynomial-time heuristics (CSSR, state-merging) offer practical synthesis but must be interpreted as approximations subject to overestimation and no worst-case guaranteed bounds.
The development of approximate, scalable algorithms with quantifiable trade-offs—e.g., constant-factor or logarithmic guaranteed approximations—remains an open domain for future work.
Existing exact methods (clique-cover for PFSM, Iterative SAT for scenario FSMs) serve as robust back-ends for benchmarking and calibrating heuristic model synthesis procedures.

This synthesis summarizes precise formulations, complexity proofs, algorithmic constructs, empirical evaluations, and key methodological insights foundational for minimum FSM identification across probabilistic and deterministic paradigms (Paulson et al., 2014, Ulyantsev et al., 2016).

Markdown Report Issue Upgrade to Chat

References (2)

Minimum Probabilistic Finite State Learning Problem on Finite Data Sets: Complexity, Solution and Approximations (2014)

Exact Finite-State Machine Identification from Scenarios and Temporal Properties (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimum FSM Identification.