Multi-Agent Finite-Horizon MDPs

Updated 9 February 2026

Multi-agent finite-horizon MDPs are mathematical models for coordinated, sequential decision making under uncertainty over a bounded time horizon.
The framework employs structured role assignments via DAGs, transition independence, and sampling-based uncertainty quantification to optimize cumulative rewards.
Empirical applications, such as LLM-based compliance workflows, demonstrate improved accuracy and efficiency compared to single-agent baselines.

A multi-agent system formalized as a finite-horizon Markov Decision Process (MDP) is a rigorous mathematical construct for modeling coordinated or distributed sequential decision making under uncertainty when multiple agents, potentially with distinct roles, interact with their environment and each other over a bounded time or decision horizon. This formalism underpins principled design of AI workflows, resource allocation, control systems, planning algorithms, and mechanism design in multi-stage, multi-agent regimes, supporting both centralized and decentralized paradigms.

1. Formal Definition and Structure

A finite-horizon multi-agent MDP is defined by the tuple $(\mathcal{S}, \mathcal{A}, \mathcal{T}, R, H)$ , where:

$\mathcal{S}$ is the global state space, which may decompose as $\mathcal{S} = \prod_{i=1}^N \mathcal{S}_i$ for $N$ agents, or include additional global (environment or context) state factors.
$\mathcal{A} = \prod_{i=1}^N \mathcal{A}_i$ is the set of joint actions, formed from the action sets of each agent.
$\mathcal{T}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]$ is the transition kernel, specifying the probability of moving from state $s$ to $s'$ given joint action $a$ .
$R:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ is the (possibly factored) per-stage reward (or cost) function.
$H$ is the finite planning horizon.

In the constrained process map framework for LLM-based multi-agent compliance workflows, the set of agents and transitions is further structured as a directed acyclic graph (DAG), $\mathcal{G}=(\mathcal{S}, E)$ , where each node corresponds to a specific agent "role" or decision stage (e.g., Worker, Triage, Risk, Legal), and edges encode allowable escalation or completion transitions (Joshi et al., 2 Feb 2026).

A complete state can embed not only the agent and environment configuration but also the current input, action trajectory, and uncertainty embeddings.

2. Transition Models and Temporal Dynamics

The transition function in multi-agent finite-horizon MDPs may be either fully joint, allowing arbitrary coupling among agents, or exploit structural factorization for tractability. Common structure assumptions include:

Transition independence: Each agent's transition dynamics are independent of others, enabling factorization and local policy computation (Sahabandu et al., 2021).
Coordination graphs: Dependencies are confined to small agent subsets inducing sparse interactions (pairwise, local neighborhoods, etc.), resulting in factored transitions and rewards (Choudhury et al., 2021).
DAG and process-map transitions: In regulated workflows, agents correspond to roles in a DAG, and allowed transitions are prescribed by workflow logic, ensuring acyclicity and a strict finite horizon (Joshi et al., 2 Feb 2026).

In dynamic settings, the formal multi-agent transition rule is enforced by the DAG or factorization constraints: $P(s_{t+1}=s' | s_t, a_t) = \begin{cases} 1 & \text{if } (s_t \rightarrow s') \in E \text{ and policy signals escalation,} \ 1 & \text{if } s' \in S_{\rm term} \text{ and policy signals a terminal label,} \ 0 & \text{otherwise.} \end{cases}$

The random trajectory always terminates in at most $H$ steps, with $H$ set to the maximum permissible path length in the DAG or explicit horizon.

3. Policy Spaces, Objective Functions, and Epistemic Uncertainty

Policies can range from fully centralized, mapping the joint global state to joint actions, to decentralized, where each agent acts based on local state and possibly beliefs about others (subject to communication and observability constraints).

The central objective is to synthesize a joint (possibly decentralized) policy $\pi^*$ that maximizes the expected cumulative reward over all horizon steps: $V^*(s_0) = \max_{\pi}\, \mathbb{E}_\pi\biggl[\sum_{t=0}^{H-1} R(s_t,a_t)\biggr].$ Reward can encode metrics such as classification accuracy, escalation and latency penalties, and resource use (Joshi et al., 2 Feb 2026).

A crucial feature in AI workflows is agent-level epistemic uncertainty. Each agent computes label posteriors using $n$ Monte Carlo samples: $\hat{p}_{s_t}(y|x_t) \approx \frac{1}{n}\sum_{k=1}^n 1\{a_t^{(k)}=y\},$ with per-agent epistemic uncertainty quantified via MC variance or entropy. These statistics are employed directly in the policy (e.g., thresholding to decide between labeling or escalation), and are propagated in the MDP state for downstream agents.

4. Representative Applications and Case Studies

This modeling framework underlies a broad spectrum of applications:

Domain	Formulation Highlights	Reference
LLM-based compliance workflows	Agents as DAG nodes, MC epistemic quant.	(Joshi et al., 2 Feb 2026)
Multi-agent sequential planning	Coordinated planning via MCTS + Max-Plus	(Choudhury et al., 2021)
Resource allocation/fair division	Nash social welfare rewards, water-filling	(Hassanzadeh et al., 2023)
Decentralized MDPs	Factored MDP, local agent observations	(Fu et al., 2022)

A salient instantiation is the self-harm detection workflow. Here, input samples are processed by a sequence of agents (Worker $\rightarrow$ Triage $\rightarrow$ {Risk, Legal}), each employing MC sampling and uncertainty quantification. The episode terminates in an automated label or is escalated for human review. Key numerical findings include up to 19% accuracy improvement and an 85 $\times$ reduction in human review relative to a single-agent baseline, without increased latency in low- $n$ configurations (Joshi et al., 2 Feb 2026).

5. Tractability, Approximation, and Scalability

The joint state-action space in multi-agent finite-horizon MDPs scales exponentially in the number of agents. Structural assumptions and approximation techniques address this:

Factorizations: Transition independence and sparse coordination graphs enable local policy computations and reduce complexity to polynomial in the number of agents for fixed horizon and local state/action size (Choudhury et al., 2021, Sahabandu et al., 2021).
Anytime and local-search schemes: Iterative local policy improvement with provable finite-horizon performance bounds under (approximate) independence and submodular rewards (Sahabandu et al., 2021).
Monte Carlo Tree Search (MCTS): Combines tree search with Max-Plus message passing for value approximation within horizon truncation (Choudhury et al., 2021).
Decentralized policy iteration with function approximation: Approximate linear programming schemes utilizing basis functions, with explicit error propagation and scaling in the agent number and feature set size (Mandal et al., 2023).
Truncated local process tolerance: Fast-decay and locality allow near-optimal decentralized policies from finite-depth local subproblems (Qu et al., 2019).

In structured workflows, the DAG and absorbing-terminal-state design guarantee bounded episode length and bypass the need for stationary policies.

6. Constraints, Fairness, and Mechanism Integration

Finite-horizon multi-agent MDPs naturally support integration of domain-specific constraints and secondary objectives:

State constraints: Via occupancy-distribution tracking and LP-based projection, state constraints (e.g., safety/capacity) yield CMDP formulations where feasible randomized policies are efficiently synthesized with performance guarantees (Chamie et al., 2015).
Equity and group objectives: Nash social welfare, min-max, and $\alpha$ -fairness objectives are realizable via modified reward structures and convex optimization, with RL-based algorithms exploiting occupancy-measure representations for regret and PAC bounds (Hassanzadeh et al., 2023, Ju et al., 2023).
Incentivized truthful reporting: In strategic agent settings, Markov-perfect equilibrium is enforced using sequential Groves or VCG mechanisms on top of the finite-horizon MDP, supporting coordination among self-interested agents with private state (Cavallo et al., 2012).

Close approximations to optimal fairness or constrained solutions are possible with explicit bounds on suboptimality and efficient computation.

7. Empirical Results, Impact, and Outlook

Empirical studies across regulated AI safety workflows, resource allocation, swarm control, and multi-stage planning consistently demonstrate the effectiveness of formalizing multi-agent processes as finite-horizon MDPs:

Metric	Baseline (Single Agent)	Multi-Agent MDP (Selected)	Reference
Accuracy	$69.8\%\pm1.8\%$	$88.0\%\pm2.3\%$	(Joshi et al., 2 Feb 2026)
Human Review Load	$17.2\pm3.6$	$0.2\pm0.6$ (n=1 MC)	(Joshi et al., 2 Feb 2026)
Processing Time(s)	$17.7\pm2.8$	$12.3\pm3.9$ (n=1 MC)	(Joshi et al., 2 Feb 2026)

Improvements in accuracy, efficiency, human-in-the-loop burden, and policy computability are typical. A plausible implication is continued adoption in AI system auditing, safety-critical decision workflows, distributed control, and collaborative autonomy.

The formalism accommodates further research in uncertainty quantification, explainability, scalable coordination, and robust mechanism design, forming a foundation for next-generation AI governance and trustworthy distributed intelligent systems.