Multi-Agent Multi-Armed Bandit

Updated 26 January 2026

Multi-Agent Multi-Armed Bandit models are frameworks where multiple agents select from stochastic arms to collaboratively minimize regret under decentralized information constraints.
They employ robust communication protocols, including consensus and gossip strategies, to optimize exploration and share limited rewards data across networks.
Advanced techniques in MA-MAB address fairness, privacy, dynamic rewards, and adversarial settings, maintaining near-optimal regret bounds in distributed learning.

Multi-Agent Multi-Armed Bandit (MA-MAB) models generalize the classical stochastic bandit framework to settings where multiple autonomous agents simultaneously interact with shared or related sets of arms, each providing stochastic rewards. The core objective is to devise agent policies and communication protocols that minimize cumulative regret across the network, subject to decentralized information constraints, adversarial disruptions, fairness criteria, privacy desiderata, collisions, dynamic environments, or reward structure complexities.

1. Core Models and Formal Definitions

The canonical MA-MAB setting involves $N$ agents indexed by $i\in\{1,\dots,N\}$ , each facing $K$ arms, $k\in\{1,\dots,K\}$ , with unknown reward distributions $\mathcal{D}_k$ (mean $\mu_k$ ) or agent-arm-specific means $\mu_{i,k}$ (Hossain et al., 2020). Interactions are orchestrated over discrete rounds $t=1,\dots,T$ , with agents selecting arms and accumulating rewards. Collaboration between agents is facilitated via various communication topologies: complete graphs (Landgren et al., 2020), stochastic or dynamic graphs (Pankayaraj et al., 2020), asynchronous pairwise gossip (Sankararaman et al., 2019), or more sophisticated blockchain-based substrates (Xu et al., 2024).

Regret is a central metric. In the fully centralized benchmark, regret after $T$ rounds scales as $R_T=\sum_{t=1}^T(\max_k \mu_k - \mu_{i_t})$ for pulls $i_t$ , but the distributed MA-MAB context admits subtleties, e.g., agent-wise regret, group regret, fairness regret (Manupriya et al., 21 Feb 2025), and Nash-social-welfare regret (Hossain et al., 2020, Xu et al., 17 Jun 2025).

Specialized extensions include collision models (agents pulling the same arm receive zero reward) (Zhou et al., 8 Oct 2025), privacy-constrained sharing (Shao et al., 21 Feb 2025), stochastic arm capacities (Xie et al., 2024), dynamic observations with linear observation cost (Madhushani et al., 2020), and mean-field population equilibrium (Wang et al., 2021).

2. Distributed Learning and Communication Protocols

Decentralized distributed MA-MAB algorithms rely on agents exchanging limited or selective information, supporting scalable learning while controlling communication complexity.

Consensus-based Estimation: Agents maintain local, running-consensus estimates for each arm—e.g., pull count and cumulative reward—updated via the graph Laplacian or stochastic matrices (Landgren et al., 2020, Cheng et al., 2023). This enables local computation of UCB indices, approximating the centralized optimal estimator with error characterized by network “explore–exploit indices” ( $\epsilon_n$ , $\epsilon_c^k$ ) determined by spectral graph properties (Landgren et al., 2020).
Gossip Protocols: Sparse, asynchronous pairwise communication (“gossiping”) attains near-centralized regret while incurring only $O(\log T)$ communication per agent (Sankararaman et al., 2019). Each agent shares arm-IDs (not samples) with a random peer, and arm knowledge spreads by rumor with exponential tails.
Selective and Dynamic Communication: Agents dynamically select communication partners using UCB-based “exploration-promise” metrics (Pankayaraj et al., 2019), prioritizing communication with peers most likely to contribute useful exploration data, and limiting total neighbor degree to $O(1)$ for scalability.
Privacy and Partial Sharing: MA-MAB variants support agents withholding information for privacy (LSI-MAMAB), with the Balanced-ETC algorithm yielding asymptotically optimal regret scaling and robust incentive mechanisms ensuring agents benefit from partial cooperation (Shao et al., 21 Feb 2025).
Blockchain-based Robust Learning: Fully decentralized MA-MABs employ pools of validators, digital signatures, and secure multi-party computation integrated with UCB bandit strategies, achieving provable logarithmic regret even under arbitrary Byzantine attacks and participant privacy constraints (Xu et al., 2024).

3. Regret Guarantees: Upper and Lower Bounds

The performance of MA-MAB algorithms is typically assessed by regret rates achievable under varying assumptions.

Instance-dependent bounds: For connected or complete graphs, stochastic rewards, and sufficient communication, distributed MA-MAB achieves $O(\log T)$ regret per agent, matching the classic UCB single-agent lower bound up to network-dependent constants (Xu et al., 2023, Landgren et al., 2020, Pankayaraj et al., 2020, Sankararaman et al., 2019). Empirical results confirm rapid convergence in both synthetic and real domains (Sankararaman et al., 2019, Pankayaraj et al., 2020).
Minimax/gap-independent bounds: The lower bound is $\Omega(\sqrt{T})$ (Xu et al., 2023), with UCB-like algorithms (centralized or decentralized with sufficient consensus) attaining $\tilde{O}(\sqrt{T})$ social welfare regret, but fairness regret may scale as $\tilde{O}(T^{3/4})$ (Manupriya et al., 21 Feb 2025).
Adversarial rewards: Even in adversarially-selected reward sequences, decentralized MA-MAB suffers at least $\Omega(T^{2/3})$ regret over connected graphs, and $\Omega(T)$ if the graph becomes disconnected (Xu et al., 2023). These bounds close prior gaps between minimax upper and lower regret rates.
Robustness to Malicious Agents: In the presence of $m$ malicious agents, collaboration advantage vanishes unless the protocol can learn and block untrustworthy recommenders; robust algorithms preserve collaborative regret reduction whenever $m\ll K$ , but revert gracefully to single-agent baseline otherwise (Vial et al., 2020).
Collision Models: Distributed elimination/UCB algorithms incorporating adaptive forced-collision communication ensure $O(\log T)$ group and per-agent regret and extremely low (doubly-logarithmic) communication cost, outperforming leader-follower and classic distributed protocols (Zhou et al., 8 Oct 2025).

Fairness Notions: MA-MAB generalizes classical bandit welfare objectives to Nash social welfare (geometric mean utility across agents) (Hossain et al., 2020, Xu et al., 17 Jun 2025) and minimum-reward-guarantee constraints (Manupriya et al., 21 Feb 2025), requiring careful exploration and allocation strategies (greedy probing, UCB allocation) to balance efficiency and individual guarantees.
Resource and Capacity Constraints: Bandits with stochastic sharable arm capacities (arrival processes per arm) require agents to estimate optimal pulling profiles without communication; distributed greedy and consensus algorithms achieve optimal assignment and $O(\log T)$ regret in Explore–Then–Commit variants (Xie et al., 2024).
Observation Costs: Agents incurring observation cost for neighbor queries must balance sampling and observation regret; dynamic protocols restricting neighbor queries to exploration phases attain $O(\log T)$ total regret, optimizing the trade-off over network degree and cost parameters (Madhushani et al., 2020).
Nonstationary, Piecewise-Stationary Rewards: Multi-agent UCB schemes augmented with change-point detectors and network-wide majority-vote restart rules—e.g., RBO-Coop-UCB—address dynamic environments, attaining $O(KNM\log T + K\sqrt{M T\log T})$ total group regret while sharply reducing detection delay and false alarms (Cheng et al., 2023).
Mean-Field Population Dynamics: In large agent populations, mean-field models approximate agent interaction by population averages; continuous-reward mean-field bandits achieve unique, globally exponentially stable equilibria under rigorous ODE contraction, with regret rates sublinear in $T$ (Wang et al., 2021).

5. Practical Algorithms and Empirical Evaluation

A range of practical decentralized MA-MAB algorithms have been analyzed and empirically validated:

Algorithm	Regret Bound	Communication Complexity
Gossip-based MA-UCB	$O((K/n+\log n)\log T)$ (Sankararaman et al., 2019)	$O(\log T)$ per agent
Consensus-based coop-UCB2	$O(\log T)$ (Landgren et al., 2020)	$O(\|E\|)$ per round
Decentralized UCB Comm.	$O(\log T)$ (Pankayaraj et al., 2019)	$O(1)$ per agent per round
Collision-elimination	$O((K/M)\log T)$ (Zhou et al., 8 Oct 2025)	$O(\log\log T)$ bits
Blockchain Robust UCB	$O(\log T)$ (Xu et al., 2024)	digital signature overhead
Fair NSW UCB	$\tilde{O}(T^{1/2})$ (Hossain et al., 2020)	centralized, but generalizes to distributed settings

Empirical evaluations on synthetic, MovieLens, ridesharing, and marketing datasets, as well as large-scale simulations, consistently demonstrate substantial reduction in average per-agent regret vs. fully independent operation. Network topology indices ( $\epsilon_n$ , $\epsilon_c^k$ ) accurately predict both group and nodal performance.

6. Open Problems and Advanced Topics

Active research directions involve:

Privacy-preserving bandit learning under limited/shared information (Shao et al., 21 Feb 2025), incentive design, and secure computation.
Robust learning under arbitrary attack or byzantine agents—the polynomial blocking subroutine is provably optimal (Vial et al., 2020).
Efficiency vs. fairness trade-offs—optimizing social welfare under Nash or minimum guarantee constraints may require up to $T^{3/4}$ fairness regret, with lower bounds currently at $\Omega(\sqrt{T})$ (Manupriya et al., 21 Feb 2025, Hossain et al., 2020).
Advanced bandit structures: stochastic arm capacities (Xie et al., 2024), dynamic environments (Cheng et al., 2023), mean-field games with continuous rewards (Wang et al., 2021).
Scaling to massive agent populations and arms; fast consensus and low-overhead communication for large graphs (Landgren et al., 2020, Sankararaman et al., 2019, Pankayaraj et al., 2019).

Future work will address theoretical tightness for advanced fairness regimes, optimal exploration under time-varying and adversarial reward models, and decentralized learning under joint privacy and security requirements.

7. Summary and Significance

Multi-agent multi-armed bandit (MA-MAB) research establishes a rigorous foundation and practical suite of algorithms for collaborative, fair, and robust online learning in distributed networks. The field's advances—from gossip-based logarithmic regret, to blockchain-robust protocols, to nuanced fairness and resource-sharing objectives—span a wide array of application domains in networked systems, social platforms, scientific collaboration, resource allocation, and federated learning. Lower and upper bounds are now tightly characterized for most canonical regimes, with ongoing research addressing the frontier of complex multi-agent interactions.