Finite-State Controllers in POMDPs

Updated 18 February 2026

Finite-state controllers (FSCs) are compact policy representations that map observations and internal memory to actions, crucial for decision-making in partially observable settings.
Advanced synthesis methods—such as bounded policy iteration, inductive synthesis via CEGAR, and learning algorithms—exploit sparsity and hierarchical structures to improve scalability and performance.
Empirical studies demonstrate that well-designed FSCs yield near-linear scaling and strong performance in robotics, AI planning, and safe offline reinforcement learning applications.

A finite-state controller (FSC) is a compact policy representation that encodes a mapping from observation and internal memory to action in a sequential decision process. FSCs are central to planning and control under partial information, especially in partially observable Markov decision processes (POMDPs) and their generalizations. Their adoption is motivated by the undecidability and intractability of computing history-dependent or fully belief-dependent policies in such domains. FSCs provide a finite-memory approximation, supporting both theoretical guarantees and scalable synthesis or learning in a variety of stochastic and deterministic settings.

1. Formal Definition and Classes of Finite-State Controllers

Let a POMDP be described by a tuple $(S, A, Z, T, O, R, \gamma)$ , where $S$ is a finite state space, $A$ a finite action set, $Z$ a finite observation set, $T:S \times A \to \mathrm{Dist}(S)$ the transition kernel, $O:S \times A \times Z \to [0,1]$ the observation kernel, $R: S \times A \to \mathbb{R}$ the reward, and $\gamma \in (0,1]$ the discount factor.

A general stochastic finite-state controller for a POMDP is the tuple

$(Q,\, \pi(\cdot, \cdot),\, \eta(\cdot, \cdot, \cdot, \cdot),\, q_0)$

where

$Q = \{q_1, \ldots, q_{|Q|}\}$ is the (finite) set of controller nodes (“memory” states),
$S$ 0 specifies the probability $S$ 1 of picking action $S$ 2 at node $S$ 3, with $S$ 4 for all $S$ 5,
$S$ 6 gives the probability $S$ 7 of moving from $S$ 8 to $S$ 9 after taking $A$ 0 and observing $A$ 1, with $A$ 2,
$A$ 3 is the initial node.

Variants are common:

Deterministic FSCs: $A$ 4 and $A$ 5 take values in $A$ 6, specifying unique action-selection and memory-update mappings.
Mealy machine models (memory-update depends only on current node and observation, independent of action).
Memoryless policies are a special case with $A$ 7.

For Decentralized POMDPs (Dec-POMDPs), each agent can have its own FSC, optionally stochastic and with distinct local observations and actions (You et al., 2021).

2. Value Evaluation and Policy Improvement in POMDPs

Given an FSC, its policy-induced value function is tractable to evaluate. For each $A$ 8, let $A$ 9 denote the value vector at node $Z$ 0. The Bellman evaluation equations are: $Z$ 1 The value of a belief $Z$ 2 is $Z$ 3 (Hansen, 2012).

Policy improvement is typically implemented by a linear program (LP) that, for each node $Z$ 4, finds $Z$ 5 and $Z$ 6 maximizing the uniform lift $Z$ 7 s.t. the improved $Z$ 8 remains dominated by the new one-step backup under the current controller parameters. If $Z$ 9 this update strictly improves the policy on the tangent belief set (Hansen, 2012).

3. Synthesis and Learning Algorithms for FSCs

3.1 Bounded Policy Iteration and Sparsity

The bounded policy iteration (BPI) algorithm improves FSCs via node-wise LPs. Empirically, most node parameter vectors are extremely sparse: in typical benchmarks, the number of nonzero elements per node grows only as $T:S \times A \to \mathrm{Dist}(S)$ 0 even as $T:S \times A \to \mathrm{Dist}(S)$ 1 increases. The sparse BPI algorithm exploits this by solving a rapidly converging sequence of reduced LPs limited to the nonzero parameters. This enables improvements in per-iteration scalability by up to $T:S \times A \to \mathrm{Dist}(S)$ 2 in observed benchmarks, restoring near-linear scaling in $T:S \times A \to \mathrm{Dist}(S)$ 3 without loss of policy improvement (Hansen, 2012).

3.2 Inductive Synthesis, CEGAR, and Parameter Synthesis Methods

Recent frameworks model the entire $T:S \times A \to \mathrm{Dist}(S)$ 4-FSC design space as either (a) a symbolic family with parameters for action and transition functions, or (b) a parametric Markov chain (pMC) with respect to these parameters (Andriushchenko et al., 2022, Junges et al., 2017). Inductive synthesis leverages counterexample-guided abstraction refinement (CEGAR) and oracles (e.g., abstraction MDPs for tight value bounds, counterexamples to block suboptimal candidates). This approach enables efficient search for small, correct-by-construction FSCs (often just 1–5 nodes) in both discounted and indefinite-horizon settings, and naturally incorporates multi-objective constraints (Andriushchenko et al., 2022).

Parameter synthesis allows computation of not just single optimal FSCs but entire permissive regions of parameters that yield correct policies, leveraging existing toolchains (Storm, PRISM, PARAM). Formally, for a desired threshold $T:S \times A \to \mathrm{Dist}(S)$ 5, one searches for parameter assignments $T:S \times A \to \mathrm{Dist}(S)$ 6 such that $T:S \times A \to \mathrm{Dist}(S)$ 7 for a rational (reachability, expected reward) function $T:S \times A \to \mathrm{Dist}(S)$ 8 arising from the pMC (Junges et al., 2017).

3.3 Hybrid and Symbiotic Synthesis Frameworks

A further development is the “search and explore” symbiotic framework, integrating belief-based and inductive methods in parallel. Belief-based search synthesizes FSCs from finite fragments of the belief MDP, while inductive search handles large families via abstraction, yielding faster synthesis, higher controller value, and lower memory usage. Alternating these modalities, each phase benefits from tighter via-cuts or action restrictions informed by the other (Andriushchenko et al., 2023).

3.4 Learning Variable-Size FSCs and Data-Driven Settings

Expectation-maximization algorithms with nonparametric priors (e.g., stick-breaking) allow learning variable-size FSCs from data in Dec-POMDPs, adjusting the effective memory size by posteriors. This is achieved using variational Bayesian EM over an implicit infinite family, automatically concentrating the posterior on a small, performant controller (Liu et al., 2015). For safe offline RL, finite-memory SPI with FSCs is feasible: a data-driven history MDP is estimated from behavior policy traces, policy improvement is then constrained on state-action pairs with insufficient data, and the solution is mapped directly back into an improved FSC with performance guarantees (Simão et al., 2023).

4. Structure, Hierarchies, and Expressiveness

FSCs can be flat or hierarchical. In generalized planning, hierarchical FSCs allow one controller to call subcontrollers (including recursively), dramatically increasing expressiveness and modularity. The existence of parameterized, recursive hierarchical FSCs allows solutions to families of problems (e.g., traversals in trees of depth $T:S \times A \to \mathrm{Dist}(S)$ 9) with $O:S \times A \times Z \to [0,1]$ 0 controller states, versus $O:S \times A \times Z \to [0,1]$ 1 for any flat FSC. Compilation techniques map controller synthesis—modular or hierarchical—into classical planning problems, with correctness guarantees for the resulting controller (Segovia-Aguas et al., 2019).

5. Complexity, Scalability, and Empirical Performance

Synthesizing or optimizing stochastic FSCs is computationally hard: deterministic $O:S \times A \times Z \to [0,1]$ 2-FSC search is NP-complete; randomized versions (with real-valued probabilities) lie in the existential theory of the reals (ETR-complete) (Andriushchenko et al., 2022, Junges et al., 2017). The raw parameter space scales as $O:S \times A \times Z \to [0,1]$ 3, but practical methods exploit:

Sparsity: limiting search or LP size to the parameter vectors’ support (Hansen, 2012).
Memory-model reductions: limiting the number of nodes relevant per observation (Andriushchenko et al., 2022).
Symbolic and symmetry breaking: pruning isomorphic or unreachable configurations (Andriushchenko et al., 2022).
Heuristics: modular controller seeding, action restriction by reference policies, value-driven memory injection (Andriushchenko et al., 2023).

Empirical results demonstrate that small FSCs—sometimes as compact as 1–5 states—can match or outperform state-of-the-art solvers on canonical benchmarks. Hierarchical and variable-size FSCs yield further advantages in expressiveness and scalability. The sparse BPI and symbiotic methods, in particular, achieve high performance with orders-of-magnitude reduction in synthesis time and controller size (Hansen, 2012, Andriushchenko et al., 2023, Andriushchenko et al., 2022).

6. Correctness Guarantees and Specification Languages

Correctness of FSCs is studied with respect to satisfaction probabilities and reward thresholds. Formal specification languages often include:

Reachability: $O:S \times A \times Z \to [0,1]$ 4 constraints (probability to reach a goal),
Expected cost/reward: $O:S \times A \times Z \to [0,1]$ 5 or $O:S \times A \times Z \to [0,1]$ 6,
Termination and goal satisfaction: total probability of terminating (LTER), probability of goal-termination (LGT) (Treszkai et al., 2019).

Synthesis algorithms such as Pandor for stochastic domains provide soundness and completeness theorems for the returned FSCs, backed by α/λ (value/contribution) bound propagation and analysis of history likelihoods and looping contributions. For FSC plus parameter synthesis (pMC-based), solution regions can be certified by SMT solving or parameter lifting (Junges et al., 2017, Treszkai et al., 2019).

In deterministic or hybrid settings, certified approximations and discrete optimization (Bellman inequalities, small-gain theorems) can yield controllers with provable bounds on performance, convergence rates, or stability (0903.5535).

7. Applications and Extensions

FSCs are used in robotics, AI planning, network protocols, verification, and safe offline RL. They support extensions to decentralized control, hierarchical and modular planning, infinite- and indefinite-horizon tasks, multi-objective specifications, and settings with partially known or data-sampled models.

Notably, FSCs enable:

Nash equilibrium search and best-response synthesis in Dec-POMDPs by reducing the best-response computation to a (potentially high-dimensional) POMDP and compressing the resulting value function into bounded-size controllers (You et al., 2021),
Safe policy improvement constrained by data and finite memory in offline RL, with high-probability performance bounds (Simão et al., 2023),
Efficient and scalable planning in deterministic POMDPs with FSC-based solvers, producing compact policies applicable in large real-world robotic planning scenarios (Schutz et al., 1 May 2025).

The general trend is toward exploiting structure—sparsity, hierarchy, modularity, and parameterization—for synthesis and learning of practical FSCs with theoretical guarantees in large-scale domains.