Joint Experience Best Response (JBR) in MARL

Updated 9 February 2026

JBR is a sample-efficient technique that computes best responses in multi-agent reinforcement learning using a single shared dataset.
It reformulates the traditional best-response problem by leveraging offline Q-iteration to estimate policies under the PSRO framework while addressing distribution shift.
Empirical results in games like poker and continuous control demonstrate that JBR significantly reduces sample cost and narrows the exploitability gap relative to standard PSRO.

Joint Experience Best Response (JBR) is a sample-efficient approach for best-response computation in multi-agent reinforcement learning (MARL), and a structural reduction technique in dynamic games, that leverages joint data collection and offline inference to overcome scalability and sample efficiency limitations in standard best-response methods. In the context of Policy Space Response Oracles (PSRO) and dynamic game solvers, JBR enables simultaneous policy improvement for all agents using a single shared experience dataset, while offering principled remedies for distribution shift intrinsic to offline policy learning. These methods address foundational tractability and robustness challenges in large-scale multi-agent environments and strategic learning.

1. Definition and Formalization

In MARL and strategic learning, best-response (BR) computation is essential for iterative algorithms such as PSRO. Traditionally, each agent computes its BR to the current meta-strategy profile, requiring independent and often expensive environment interactions. JBR reformulates best-response computation by collecting a single dataset of environment trajectories under the joint meta-strategy profile $\sigma = (\sigma_1, ..., \sigma_n)$ , denoted:

$D^\sigma = \{ (s_t, a_{1,t}, ..., a_{n,t}, r_{1,t}, ..., r_{n,t}, s_{t+1}) \}_{t=0}^T$

For each agent $i$ , the best-response problem is cast as an offline RL task using $D^\sigma$ . Empirical reward and transition models $\widehat{r}_i^{\sigma_{-i}}, \widehat{P}^{\sigma_{-i}}$ are estimated as:

$\widehat{P}^{\sigma_{-i}}(s'|s,a_i) = \sum_{a_{-i}} \sigma_{-i}(a_{-i}|s) \cdot P(s'|s,a_i,a_{-i})$

$\widehat{r}_i^{\sigma_{-i}}(s,a_i) = \sum_{a_{-i}} \sigma_{-i}(a_{-i}|s) \cdot r_i(s,a_i,a_{-i})$

Q-iteration is then performed to estimate the offline Q-function and derive the new best-response policy:

$Q_i^{k+1}(s,a_i) = \widehat{r}_i^{\sigma_{-i}}(s,a_i) + \gamma \sum_{s'} \widehat{P}^{\sigma_{-i}}(s'|s,a_i)\, \max_{a'_i} Q_i^k(s',a'_i)$

As $k \to \infty$ , $Q_i^*$ yields $\pi_i^{BR}(s) \in \arg\max_{a_i} Q_i^*(s,a_i)$ (Bighashdel et al., 6 Feb 2026).

This conversion yields a joint, sample-amortized solution—each agent reuses $D^\sigma$ instead of independently sampling, making JBR a "drop-in modification" to PSRO that significantly reduces environment interaction.

2. Algorithmic Structure: JBR in PSRO

The incorporation of JBR into the PSRO framework modifies the standard iterative workflow:

Initialize restricted policy sets $X_i \subset \Pi_i$ for all $i$ .
Estimate the restricted game $\hat{G}=(N, (X_i), (\hat{u}_i))$ and solve for the meta-strategy profile $\sigma = \mathrm{MSS}(\hat{G})$ .
While not converged: a. Collect $D^\sigma$ by rolling out $\sigma$ . b. For each agent $i$ : - Estimate $\widehat{P}^{\sigma_{-i}}, \widehat{r}_i^{\sigma_{-i}}$ from $D^\sigma$ . - Compute $\pi_i^{BR}$ via offline Q-iteration. - Update $X_i \leftarrow X_i \cup \{\pi_i^{BR}\}$ . c. Re-estimate $\hat{G}$ and update $\sigma \leftarrow \mathrm{MSS}(\hat{G})$ .
Output final meta-strategy $\sigma$ .

Unlike standard PSRO that executes per-agent online BR computation, JBR executes all BR updates in parallel, entirely offline, with a single shared dataset (Bighashdel et al., 6 Feb 2026).

3. Remedies for Distribution-Shift Bias in Offline JBR

Because the dataset $D^\sigma$ is generated under the meta-strategy $\sigma$ rather than agent-specific evolving BR policies, the resulting offline RL tasks may suffer from distributional shift and limited state-action coverage. Three mechanisms are introduced:

Conservative JBR (Safe Policy Improvement): For each $(s,a_i)$ with coverage below a fixed threshold $N_\wedge$ (i.e., $N_D(s,a_i) < N_\wedge$ ), constrain the new policy $\pi_i$ to revert to baseline $\sigma_i$ , optimizing only in well-visited regions. This ensures “safe improvement”— $\pi_i$ is never worse than $\sigma_i$ up to model error.
Exploration-Augmented JBR: At data collection, each agent j perturbs its behavior with probability $\delta$ , using a mixture between $\sigma_j$ and an exploration policy $\nu_j$ (e.g., uniform random or the current BR candidate):

$\tilde{\sigma}_j^{(\delta,\nu_j)}(a_j|s) = (1-\delta)\sigma_j(a_j|s) + \delta \nu_j(a_j|s)$

Two variants are proposed: random exploration ( $\nu_j=\text{Unif}(A_j)$ , "JBR-PSRO- $\delta$ R") and targeted ( $\nu_j = \pi_j^{BR,cur}$ , "JBR-PSRO- $\delta$ T"). Theoretical guarantees are provided for finite two-player zero-sum games: If each agent computes an $\epsilon$ -best response, the final meta-strategy is an $(\epsilon + 2R\delta)$ -Nash equilibrium, where $R$ is the reward range.

Hybrid BR: Alternates between JBR (possibly exploration-augmented) and standard independent BR (IBR) at a fixed period $k$ . This periodicity allows intermittent exact BR computation to correct offline errors while maintaining sample efficiency.

These mechanisms address coverage deficiencies and partially restore the convergence properties and robustness of standard PSRO at lower cost (Bighashdel et al., 6 Feb 2026).

4. Empirical Performance and Comparative Metrics

Empirical evaluation benchmarks JBR and its variants on poker (Kuhn, Leduc) and continuous multi-agent control (Simple Tag, Adversary, Push), focusing on sample efficiency (episodes used) and exploitability (NashConv):

Algorithm	NashConv (Leduc)	BR Episodes Used	Relative Sample Cost
PSRO (Standard)	~0.01	~2.0×10⁶	100%
Naïve JBR	~0.15	~1.0×10⁶	50%
Conservative JBR (SPI)	~0.1–0.12	~1.0×10⁶	50%
JBR-PSRO-δR (δ=0.1)	~0.07	~1.0×10⁶	50%
JBR-PSRO-δT (δ=0.5)	~0.015	~1.0×10⁶	50%
Hybrid HBR-PSRO(10)-δT	~0.01	~1.2×10⁶	60%

In continuous control environments, JBR-PSRO-δT attains approximate NashConv values comparable to standard PSRO and outperforms baselines such as IL/DDPG and CTDE (MADDPG), while naïve JBR underperforms due to insufficient state-action coverage. Exploring higher δ in JBR-PSRO-δT yields improved accuracy up to a plateau, after which performance may degrade (Bighashdel et al., 6 Feb 2026).

5. Extension: Best-Response Map Embedding in Dynamic Games

Beyond the PSRO context, best-response operators are used for structural reduction in finite-horizon dynamic games. Rather than jointly solving KKT conditions for all players, the equilibrium computation is restructured as follows (Rabbani et al., 5 Feb 2026):

Let $Z_i=(X_i, U_i)$ denote the trajectory for player $i$ . For fixed $Z_{-i}$ , the best-response map is:

$\mathrm{BR}_i(Z_{-i}) \in \arg\min_{Z_i} J_i(Z_i, Z_{-i})$

Subject to dynamics $X_{i, k+1}=f_i(x_{i, k}, u_{i, k})$ and constraints. The reduced problem imposes an explicit feasibility constraint $Z_2=\mathcal{B}_2(Z_1)$ , where $\mathcal{B}_2$ is either an exact or surrogate best-response operator. Optimization proceeds over $Z_1$ and $Z_2$ , enforcing KKT conditions for player 1 and feasibility for player 2, avoiding nested differentiations and yielding numerically efficient solutions.

Where an approximate surrogate $\widehat{\mathcal{B}}_2$ is learned from offline data (e.g., using MLPs), approximate equilibrium consistency is achieved up to best-response error. This removes nested optimal-control solves and allows possibly asymmetric-information settings (Rabbani et al., 5 Feb 2026).

6. Practical Implications and Limitations

JBR represents a substantial sample efficiency improvement: in symmetric $n$ -player games, best-response sample usage is reduced by approximately a factor of $n$ , with up to 50% total reduction in environments such as Leduc Poker. Exploration-augmented JBR-δT further narrows the exploitability gap with standard PSRO. Hybrid BR schedules enable tuning the trade-off between exactitude and cost.

However, the scaling of PSRO’s meta-strategy update and payoff matrix with policy population remains an open challenge. The efficacy of JBR depends on sufficiently broad exploration during data collection; too little exploration (low δ) or very high-dimensional environments may impair offline RL performance and result in suboptimal BRs. Prospective research areas include adaptive tuning for δ and k, scalable meta-solvers, and the use of learned priors. For dynamic games, the generalization to multi-player and heterogeneous-dynamics scenarios remains a significant future direction (Bighashdel et al., 6 Feb 2026, Rabbani et al., 5 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response (2026)

A Data Driven Structural Decomposition of Dynamic Games via Best Response Maps (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Joint Experience Best Response (JBR).