Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning
Abstract: In cooperative multi-agent reinforcement learning (MARL), well-designed communication protocols can effectively facilitate consensus among agents, thereby enhancing task performance. Moreover, in large-scale multi-agent systems commonly found in real-world applications, effective communication plays an even more critical role due to the escalated challenge of partial observability compared to smaller-scale setups. In this work, we endeavor to develop a scalable communication protocol for MARL. Unlike previous methods that focus on selecting optimal pairwise communication links-a task that becomes increasingly complex as the number of agents grows-we adopt a global perspective on communication topology design. Specifically, we propose utilizing the exponential topology to enable rapid information dissemination among agents by leveraging its small-diameter and small-size properties. This approach leads to a scalable communication protocol, named ExpoComm. To fully unlock the potential of exponential graphs as communication topologies, we employ memory-based message processors and auxiliary tasks to ground messages, ensuring that they reflect global information and benefit decision-making. Extensive experiments on large-scale cooperative benchmarks, including MAgent and Infrastructure Management Planning, demonstrate the superior performance and robust zero-shot transferability of ExpoComm compared to existing communication strategies. The code is publicly available at https://github.com/LXXXXR/ExpoComm.
Summary
- The paper introduces ExpoComm, a scalable MARL communication protocol leveraging fixed exponential graph topologies for $O(N \log N)$ or $O(N)$ communication cost, unlike quadratic learned methods.
- ExpoComm employs memory-based message processing (Attention or RNN) and auxiliary tasks (state prediction or contrastive learning) to effectively utilize rapid information spread across agents.
- Experiments show ExpoComm outperforms baselines on large-scale MARL tasks, achieves high performance with low $O(N)$ communication overhead, and exhibits robust zero-shot transferability to larger agent populations.
Cooperative multi-agent reinforcement learning (MARL) faces significant challenges in scaling communication protocols, particularly as the number of agents (N) increases. Mitigating partial observability via communication is crucial, but existing methods often rely on learning pairwise communication links, incurring high computational costs (often quadratic in N) and struggling to identify relevant connections in large systems. The paper "Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning" (2502.19717) introduces ExpoComm, a communication protocol designed explicitly for scalability by shifting focus from learned pairwise links to the design of an efficient global communication topology.
ExpoComm Framework
ExpoComm leverages the structural properties of exponential graphs to facilitate rapid and efficient information dissemination across large agent populations. Instead of learning connectivity, it employs a pre-defined, rule-based topology that guarantees desirable scaling properties.
Exponential Graph Topology
The core idea is to utilize a communication graph with both a small diameter (ensuring fast information spread) and small size (ensuring low communication overhead). Exponential graphs satisfy these criteria. The paper considers two variants:
- Static Exponential Graph: Agents are arranged conceptually on a ring. Each agent i connects to neighbors at distances 2k in both clockwise and counter-clockwise directions, for k=0,1,…,⌊log2(N−1)⌋. The neighbors of agent i are {(i±2k)(modN)∣k=0,…,⌊log2(N−1)⌋}. This graph has a diameter of ⌈log2(N−1)⌉ and a size (total number of directed edges) of O(NlogN). Each agent has a degree of O(logN).
- One-peer Exponential Graph: To further reduce communication overhead, this variant introduces sparsity dynamically. At timestep t, agent i connects to only one peer agent (i+2k)(modN), where k=t(mod⌈log2(N−1)⌉+1). Over a cycle of ⌈log2(N−1)⌉+1 steps, each agent effectively communicates along the links of the static exponential graph. This reduces the instantaneous communication degree to 1 for each agent, resulting in a graph size of O(N) per timestep. The effective diameter, considering information propagation over multiple steps, remains O(logN).
The logarithmic diameter ensures that information originating from any agent can reach any other agent within a number of steps that grows very slowly with N. The (near-)linear size ensures that the communication cost remains tractable for large systems.
Neural Architecture and Message Processing
To effectively utilize the multi-hop information propagation inherent in the small-diameter topology, ExpoComm employs memory-based message processors. Each agent i maintains a message state mit. The message update rule depends on the chosen topology:
- Static Graph: An agent updates its message based on its previous message and messages received from all O(logN) neighbors. This is amenable to permutation-invariant aggregation mechanisms like Attention:
mit=Attention(mit−1,{mjt−1∣j∈Ni})
where Ni are the neighbors of agent i in the static exponential graph.
- One-peer Graph: An agent receives only one message per timestep from its designated peer j. A recurrent neural network (RNN, e.g., GRU) is well-suited for integrating this sequential information over time:
mit=GRU(mit−1,mjt−1)
where j=(i+2k)(modN) with k=t(mod⌈log2(N−1)⌉+1).
These processors allow agents to implicitly accumulate information propagated through the exponential graph over the ≈logN steps required to bridge the graph diameter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
class ExpoCommAgent: def __init__(self, agent_id, num_agents, use_static_graph=True, message_dim=64): self.agent_id = agent_id self.num_agents = num_agents self.use_static_graph = use_static_graph self.message_state = torch.zeros(message_dim) # m_i^{t-1} if use_static_graph: # Precompute static neighbors based on exponential graph rule self.neighbors = self._compute_static_neighbors() # Initialize Attention module (e.g., MultiHeadAttention) self.message_processor = AttentionModule(message_dim) else: # Initialize GRU cell self.message_processor = GRUCell(message_dim, message_dim) def _compute_static_neighbors(self): neighbors = set() for k in range(int(math.log2(self.num_agents - 1)) + 1): dist = 1 << k neighbors.add((self.agent_id + dist) % self.num_agents) neighbors.add((self.agent_id - dist + self.num_agents) % self.num_agents) return list(neighbors - {self.agent_id}) # Exclude self-loops if needed def _get_one_peer_neighbor(self, timestep): log_n = int(math.log2(self.num_agents - 1)) + 1 k = timestep % log_n dist = 1 << k peer_id = (self.agent_id + dist) % self.num_agents return peer_id def update_message(self, received_messages, timestep): """ received_messages: Dict mapping agent_id -> message_tensor (m_j^{t-1}) """ if self.use_static_graph: neighbor_messages = [received_messages[j] for j in self.neighbors if j in received_messages] if not neighbor_messages: # Handle case with no neighbors or no received messages neighbor_messages_tensor = torch.zeros(0, self.message_state.shape[0]) else: neighbor_messages_tensor = torch.stack(neighbor_messages) # Process using Attention: Query=prev_message, Key/Value=neighbor_messages # Simplified view: # new_message = self.message_processor(self.message_state, neighbor_messages_tensor) # Actual implementation depends on the specific Attention mechanism used # For example, using PyTorch's MultiheadAttention: # Assume self.message_processor = nn.MultiheadAttention(...) # query = self.message_state.unsqueeze(0).unsqueeze(0) # Shape (1, 1, embed_dim) # key = value = neighbor_messages_tensor.unsqueeze(1) # Shape (num_neighbors, 1, embed_dim) # attn_output, _ = self.message_processor(query, key, value) # new_message = attn_output.squeeze(0).squeeze(0) # Shape (embed_dim) # Simpler aggregation like mean/max followed by MLP is also possible if neighbor_messages_tensor.shape[0] > 0: aggregated_neighbors = torch.mean(neighbor_messages_tensor, dim=0) combined_input = torch.cat((self.message_state, aggregated_neighbors), dim=0) # Pass combined_input through an MLP (part of message_processor) new_message = self.message_processor(combined_input) # Assuming MLP processor else: new_message = self.message_processor(torch.cat((self.message_state, torch.zeros_like(self.message_state)), dim=0)) else: # One-peer graph peer_id = self._get_one_peer_neighbor(timestep) if peer_id in received_messages: peer_message = received_messages[peer_id] else: # Handle case where peer message is not available (e.g., beginning of episode) peer_message = torch.zeros_like(self.message_state) # Process using GRU: Input=peer_message, HiddenState=prev_message new_message = self.message_processor(peer_message.unsqueeze(0), self.message_state.unsqueeze(0)) new_message = new_message.squeeze(0) self.message_state = new_message # Update m_i^t return self.message_state def get_action(self, observation): # Action selection uses local observation and current message state # policy_input = torch.cat((observation, self.message_state), dim=...) # action = self.policy_network(policy_input) pass |
Message Grounding via Auxiliary Tasks
To ensure the propagated messages mit contain globally relevant information useful for decision-making, ExpoComm incorporates auxiliary tasks during the centralized training phase (within a CTDE framework like QMIX or VDN). Two alternatives are proposed:
- Global State Prediction: If the global state st is accessible during training, a decoder network attempts to reconstruct st from each agent's message mit. The loss is typically the Mean Squared Error (MSE):
LAuxpred=N1i∑∣∣Decoder(mit)−st∣∣2
- Contrastive Learning: When the global state is unavailable or too high-dimensional, a contrastive objective (InfoNCE) is used. Messages from different agents at the same timestep (mit,mjt for i=j) are treated as positive pairs, encouraged to be similar. Messages from significantly different timesteps (mit,mjt′ where ∣t−t′∣>τ) are negative pairs, pushed apart. This encourages messages to encode a shared representation of the current global context.
LAuxcont=−N1i∑log∑j=iexp(sim(mit,mjt)/θ)+∑j′∑t′:∣t−t′∣>τexp(sim(mit,mj′t′)/θ)∑j=iexp(sim(mit,mjt)/θ)
where sim is a similarity function (e.g., cosine similarity) and θ is a temperature parameter.
The total loss combines the standard MARL task loss (e.g., TD-error from QMIX) with the auxiliary loss, weighted by a coefficient α: L=LTD+α⋅LAux.
Scalability and Theoretical Advantages
ExpoComm's design directly addresses the scalability limitations of prior communication protocols:
- Communication Cost: The number of messages transmitted per timestep scales as O(NlogN) for the static graph and O(N) for the one-peer graph. This contrasts favorably with fully-connected graphs (O(N2)) or attention-based methods like CommFormer that often compute pairwise attention scores (O(N2) complexity).
- Information Propagation Speed: The logarithmic diameter O(logN) guarantees that global information can theoretically disseminate across the entire network much faster than in topologies with larger diameters, such as chains or grids (O(N) or O(N)), or potentially sparse learned graphs.
- Computational Cost: Using fixed topologies avoids the optimization overhead of learning communication links. While attention mechanisms in the static variant still have computational cost, they operate only over O(logN) neighbors per agent, compared to O(N) in fully connected attention. The GRU-based one-peer variant is computationally very efficient. The overall training complexity, especially GPU memory usage, is significantly lower than methods involving dense pairwise interactions for large N.
Experimental Validation
The effectiveness of ExpoComm was evaluated on large-scale cooperative MARL benchmarks, specifically MAgent (AdversarialPursuit, Battle) and the Infrastructure Management Planning (IMP) suite (Uncorrelated, Correlated, OWF), with agent populations ranging from N=20 to N=100.
Performance Comparison
ExpoComm variants were compared against several baselines:
- No communication (IDQN/QMIX)
- Distance-based communication (DGN+TarMAC)
- Random graph communication (ER graph + TarMAC)
- Learned communication (CommFormer)
Key findings include:
- ExpoComm consistently outperformed all baselines across the tested scenarios and agent numbers, often by a significant margin, especially at larger scales (N≥60).
- The results held under different communication budgets, specifically comparing ExpoComm with K=⌈log2(N−1)⌉ neighbors (static graph or multi-peer dynamic) and K=1 neighbor (one-peer graph).
- Notably, the ExpoComm variant using the one-peer graph (O(N) communication cost) frequently achieved performance comparable to or even exceeding the static graph variant (O(NlogN) cost) and significantly outperformed baselines, demonstrating high communication efficiency. For instance, in MAgent Battle (N=100), ExpoComm (K=1) achieved nearly twice the win rate of the next best baseline (CommFormer).
Zero-Shot Transferability
A critical result highlighted is ExpoComm's robust zero-shot transfer capability. Models trained with a specific number of agents (e.g., N=60) were directly evaluated on scenarios with a larger number of agents (e.g., N=100) without any retraining.
- ExpoComm demonstrated significantly better generalization performance compared to baselines in these transfer tasks. While baseline performance often degraded sharply, ExpoComm maintained a high level of effectiveness.
- This robustness is attributed to the global, rule-based nature of the exponential topology, which does not depend on learning pairwise interactions specific to the training scale N. The structure inherently adapts to different N.
Ablation Studies
Ablations confirmed the necessity of key components:
- Removing the memory-based message processors (RNN/Attention) led to a substantial performance drop, confirming their importance for integrating information over multiple hops.
- Disabling the auxiliary message grounding tasks also degraded performance, showing their role in making the communicated messages meaningful for the task.
Implementation Considerations
ExpoComm offers practical advantages for implementation and deployment:
- Simplicity: The rule-based nature of exponential graphs makes them straightforward to construct for any number of agents N.
- Flexibility: The choice between the static (O(NlogN) cost, parallel processing via Attention) and one-peer (O(N) cost, sequential processing via RNN) variants allows practitioners to trade off communication cost and computational parallelism based on application constraints.
- Reduced Overheads: Avoids the complex search or learning procedures for communication links required by other methods, simplifying the training pipeline and reducing computational requirements, particularly GPU memory for large N.
Conclusion
ExpoComm presents a scalable and effective approach to communication in large-scale MARL systems by leveraging the graph-theoretic properties of exponential topologies. Its design provides a principled way to achieve rapid global information dissemination with low, (near-)linearly scaling communication overhead. The strong empirical results, particularly the performance of the highly efficient one-peer variant and the robust zero-shot transferability, demonstrate its potential as a practical communication backbone for real-world multi-agent applications involving dozens or hundreds of agents.
Paper to Video (Beta)
No one has generated a video about this paper yet.
Whiteboard
No one has generated a whiteboard explanation for this paper yet.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Open Problems
We haven't generated a list of open problems mentioned in this paper yet.
Continue Learning
- How does the exponential graph topology in ExpoComm compare to other communication topologies (such as grid or random graphs) in terms of resilience to agent failure or dynamic changes in network connectivity?
- What are the potential limitations of ExpoComm's rule-based communication graph in tasks requiring adaptive or context-sensitive communication patterns among agents?
- Can ExpoComm be integrated with decentralised learning algorithms beyond CTDE frameworks like QMIX or VDN, and what challenges might arise?
- How does the use of auxiliary grounding tasks, such as contrastive learning, impact the interpretability or transferability of learned agent representations in ExpoComm?
- Find recent papers about scalable communication topologies for large multi-agent reinforcement learning systems.
Related Papers
- Learning when to Communicate at Scale in Multiagent Cooperative and Competitive Tasks (2018)
- Exponential Graph is Provably Efficient for Decentralized Deep Training (2021)
- Learning Individually Inferred Communication for Multi-Agent Cooperation (2020)
- Learning Attentional Communication for Multi-Agent Cooperation (2018)
- Learning Multiagent Communication with Backpropagation (2016)
- Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch (2025)
- Asynchronous Cooperative Multi-Agent Reinforcement Learning with Limited Communication (2025)
- PAGNet: Pluggable Adaptive Generative Networks for Information Completion in Multi-Agent Communication (2025)
- Robust Event-Triggered Integrated Communication and Control with Graph Information Bottleneck Optimization (2025)
- AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration (2025)
Authors (4)
Collections
Sign up for free to add this paper to one or more collections.