Papers
Topics
Authors
Recent
Search
2000 character limit reached

Joint Experience Best Response (JBR) in MARL

Updated 9 February 2026
  • JBR is a sample-efficient technique that computes best responses in multi-agent reinforcement learning using a single shared dataset.
  • It reformulates the traditional best-response problem by leveraging offline Q-iteration to estimate policies under the PSRO framework while addressing distribution shift.
  • Empirical results in games like poker and continuous control demonstrate that JBR significantly reduces sample cost and narrows the exploitability gap relative to standard PSRO.

Joint Experience Best Response (JBR) is a sample-efficient approach for best-response computation in multi-agent reinforcement learning (MARL), and a structural reduction technique in dynamic games, that leverages joint data collection and offline inference to overcome scalability and sample efficiency limitations in standard best-response methods. In the context of Policy Space Response Oracles (PSRO) and dynamic game solvers, JBR enables simultaneous policy improvement for all agents using a single shared experience dataset, while offering principled remedies for distribution shift intrinsic to offline policy learning. These methods address foundational tractability and robustness challenges in large-scale multi-agent environments and strategic learning.

1. Definition and Formalization

In MARL and strategic learning, best-response (BR) computation is essential for iterative algorithms such as PSRO. Traditionally, each agent computes its BR to the current meta-strategy profile, requiring independent and often expensive environment interactions. JBR reformulates best-response computation by collecting a single dataset of environment trajectories under the joint meta-strategy profile σ=(σ1,...,σn)\sigma = (\sigma_1, ..., \sigma_n), denoted:

Dσ={(st,a1,t,...,an,t,r1,t,...,rn,t,st+1)}t=0TD^\sigma = \{ (s_t, a_{1,t}, ..., a_{n,t}, r_{1,t}, ..., r_{n,t}, s_{t+1}) \}_{t=0}^T

For each agent ii, the best-response problem is cast as an offline RL task using DσD^\sigma. Empirical reward and transition models r^iσi,P^σi\widehat{r}_i^{\sigma_{-i}}, \widehat{P}^{\sigma_{-i}} are estimated as:

P^σi(ss,ai)=aiσi(ais)P(ss,ai,ai)\widehat{P}^{\sigma_{-i}}(s'|s,a_i) = \sum_{a_{-i}} \sigma_{-i}(a_{-i}|s) \cdot P(s'|s,a_i,a_{-i})

r^iσi(s,ai)=aiσi(ais)ri(s,ai,ai)\widehat{r}_i^{\sigma_{-i}}(s,a_i) = \sum_{a_{-i}} \sigma_{-i}(a_{-i}|s) \cdot r_i(s,a_i,a_{-i})

Q-iteration is then performed to estimate the offline Q-function and derive the new best-response policy:

Qik+1(s,ai)=r^iσi(s,ai)+γsP^σi(ss,ai)maxaiQik(s,ai)Q_i^{k+1}(s,a_i) = \widehat{r}_i^{\sigma_{-i}}(s,a_i) + \gamma \sum_{s'} \widehat{P}^{\sigma_{-i}}(s'|s,a_i)\, \max_{a'_i} Q_i^k(s',a'_i)

As kk \to \infty, QiQ_i^* yields πiBR(s)argmaxaiQi(s,ai)\pi_i^{BR}(s) \in \arg\max_{a_i} Q_i^*(s,a_i) (Bighashdel et al., 6 Feb 2026).

This conversion yields a joint, sample-amortized solution—each agent reuses DσD^\sigma instead of independently sampling, making JBR a "drop-in modification" to PSRO that significantly reduces environment interaction.

2. Algorithmic Structure: JBR in PSRO

The incorporation of JBR into the PSRO framework modifies the standard iterative workflow:

  1. Initialize restricted policy sets XiΠiX_i \subset \Pi_i for all ii.
  2. Estimate the restricted game G^=(N,(Xi),(u^i))\hat{G}=(N, (X_i), (\hat{u}_i)) and solve for the meta-strategy profile σ=MSS(G^)\sigma = \mathrm{MSS}(\hat{G}).
  3. While not converged: a. Collect DσD^\sigma by rolling out σ\sigma. b. For each agent ii: - Estimate P^σi,r^iσi\widehat{P}^{\sigma_{-i}}, \widehat{r}_i^{\sigma_{-i}} from DσD^\sigma. - Compute πiBR\pi_i^{BR} via offline Q-iteration. - Update XiXi{πiBR}X_i \leftarrow X_i \cup \{\pi_i^{BR}\}. c. Re-estimate G^\hat{G} and update σMSS(G^)\sigma \leftarrow \mathrm{MSS}(\hat{G}).
  4. Output final meta-strategy σ\sigma.

Unlike standard PSRO that executes per-agent online BR computation, JBR executes all BR updates in parallel, entirely offline, with a single shared dataset (Bighashdel et al., 6 Feb 2026).

3. Remedies for Distribution-Shift Bias in Offline JBR

Because the dataset DσD^\sigma is generated under the meta-strategy σ\sigma rather than agent-specific evolving BR policies, the resulting offline RL tasks may suffer from distributional shift and limited state-action coverage. Three mechanisms are introduced:

  • Conservative JBR (Safe Policy Improvement): For each (s,ai)(s,a_i) with coverage below a fixed threshold NN_\wedge (i.e., ND(s,ai)<NN_D(s,a_i) < N_\wedge), constrain the new policy πi\pi_i to revert to baseline σi\sigma_i, optimizing only in well-visited regions. This ensures “safe improvement”—πi\pi_i is never worse than σi\sigma_i up to model error.
  • Exploration-Augmented JBR: At data collection, each agent j perturbs its behavior with probability δ\delta, using a mixture between σj\sigma_j and an exploration policy νj\nu_j (e.g., uniform random or the current BR candidate):

σ~j(δ,νj)(ajs)=(1δ)σj(ajs)+δνj(ajs)\tilde{\sigma}_j^{(\delta,\nu_j)}(a_j|s) = (1-\delta)\sigma_j(a_j|s) + \delta \nu_j(a_j|s)

Two variants are proposed: random exploration (νj=Unif(Aj)\nu_j=\text{Unif}(A_j), "JBR-PSRO-δ\deltaR") and targeted (νj=πjBR,cur\nu_j = \pi_j^{BR,cur}, "JBR-PSRO-δ\deltaT"). Theoretical guarantees are provided for finite two-player zero-sum games: If each agent computes an ϵ\epsilon-best response, the final meta-strategy is an (ϵ+2Rδ)(\epsilon + 2R\delta)-Nash equilibrium, where RR is the reward range.

  • Hybrid BR: Alternates between JBR (possibly exploration-augmented) and standard independent BR (IBR) at a fixed period kk. This periodicity allows intermittent exact BR computation to correct offline errors while maintaining sample efficiency.

These mechanisms address coverage deficiencies and partially restore the convergence properties and robustness of standard PSRO at lower cost (Bighashdel et al., 6 Feb 2026).

4. Empirical Performance and Comparative Metrics

Empirical evaluation benchmarks JBR and its variants on poker (Kuhn, Leduc) and continuous multi-agent control (Simple Tag, Adversary, Push), focusing on sample efficiency (episodes used) and exploitability (NashConv):

Algorithm NashConv (Leduc) BR Episodes Used Relative Sample Cost
PSRO (Standard) ~0.01 ~2.0×10⁶ 100%
Naïve JBR ~0.15 ~1.0×10⁶ 50%
Conservative JBR (SPI) ~0.1–0.12 ~1.0×10⁶ 50%
JBR-PSRO-δR (δ=0.1) ~0.07 ~1.0×10⁶ 50%
JBR-PSRO-δT (δ=0.5) ~0.015 ~1.0×10⁶ 50%
Hybrid HBR-PSRO(10)-δT ~0.01 ~1.2×10⁶ 60%

In continuous control environments, JBR-PSRO-δT attains approximate NashConv values comparable to standard PSRO and outperforms baselines such as IL/DDPG and CTDE (MADDPG), while naïve JBR underperforms due to insufficient state-action coverage. Exploring higher δ in JBR-PSRO-δT yields improved accuracy up to a plateau, after which performance may degrade (Bighashdel et al., 6 Feb 2026).

5. Extension: Best-Response Map Embedding in Dynamic Games

Beyond the PSRO context, best-response operators are used for structural reduction in finite-horizon dynamic games. Rather than jointly solving KKT conditions for all players, the equilibrium computation is restructured as follows (Rabbani et al., 5 Feb 2026):

Let Zi=(Xi,Ui)Z_i=(X_i, U_i) denote the trajectory for player ii. For fixed ZiZ_{-i}, the best-response map is:

BRi(Zi)argminZiJi(Zi,Zi)\mathrm{BR}_i(Z_{-i}) \in \arg\min_{Z_i} J_i(Z_i, Z_{-i})

Subject to dynamics Xi,k+1=fi(xi,k,ui,k)X_{i, k+1}=f_i(x_{i, k}, u_{i, k}) and constraints. The reduced problem imposes an explicit feasibility constraint Z2=B2(Z1)Z_2=\mathcal{B}_2(Z_1), where B2\mathcal{B}_2 is either an exact or surrogate best-response operator. Optimization proceeds over Z1Z_1 and Z2Z_2, enforcing KKT conditions for player 1 and feasibility for player 2, avoiding nested differentiations and yielding numerically efficient solutions.

Where an approximate surrogate B^2\widehat{\mathcal{B}}_2 is learned from offline data (e.g., using MLPs), approximate equilibrium consistency is achieved up to best-response error. This removes nested optimal-control solves and allows possibly asymmetric-information settings (Rabbani et al., 5 Feb 2026).

6. Practical Implications and Limitations

JBR represents a substantial sample efficiency improvement: in symmetric nn-player games, best-response sample usage is reduced by approximately a factor of nn, with up to 50% total reduction in environments such as Leduc Poker. Exploration-augmented JBR-δT further narrows the exploitability gap with standard PSRO. Hybrid BR schedules enable tuning the trade-off between exactitude and cost.

However, the scaling of PSRO’s meta-strategy update and payoff matrix with policy population remains an open challenge. The efficacy of JBR depends on sufficiently broad exploration during data collection; too little exploration (low δ) or very high-dimensional environments may impair offline RL performance and result in suboptimal BRs. Prospective research areas include adaptive tuning for δ and k, scalable meta-solvers, and the use of learned priors. For dynamic games, the generalization to multi-player and heterogeneous-dynamics scenarios remains a significant future direction (Bighashdel et al., 6 Feb 2026, Rabbani et al., 5 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Joint Experience Best Response (JBR).