Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning

Published 27 Feb 2025 in cs.MA, cs.AI, and cs.LG | (2502.19717v1)

Abstract: In cooperative multi-agent reinforcement learning (MARL), well-designed communication protocols can effectively facilitate consensus among agents, thereby enhancing task performance. Moreover, in large-scale multi-agent systems commonly found in real-world applications, effective communication plays an even more critical role due to the escalated challenge of partial observability compared to smaller-scale setups. In this work, we endeavor to develop a scalable communication protocol for MARL. Unlike previous methods that focus on selecting optimal pairwise communication links-a task that becomes increasingly complex as the number of agents grows-we adopt a global perspective on communication topology design. Specifically, we propose utilizing the exponential topology to enable rapid information dissemination among agents by leveraging its small-diameter and small-size properties. This approach leads to a scalable communication protocol, named ExpoComm. To fully unlock the potential of exponential graphs as communication topologies, we employ memory-based message processors and auxiliary tasks to ground messages, ensuring that they reflect global information and benefit decision-making. Extensive experiments on large-scale cooperative benchmarks, including MAgent and Infrastructure Management Planning, demonstrate the superior performance and robust zero-shot transferability of ExpoComm compared to existing communication strategies. The code is publicly available at https://github.com/LXXXXR/ExpoComm.

Abstract PDF Upgrade to Chat

Summary

The paper introduces ExpoComm, a scalable MARL communication protocol leveraging fixed exponential graph topologies for $O(N \log N)$ or $O(N)$ communication cost, unlike quadratic learned methods.
ExpoComm employs memory-based message processing (Attention or RNN) and auxiliary tasks (state prediction or contrastive learning) to effectively utilize rapid information spread across agents.
Experiments show ExpoComm outperforms baselines on large-scale MARL tasks, achieves high performance with low $O(N)$ communication overhead, and exhibits robust zero-shot transferability to larger agent populations.

Cooperative multi-agent reinforcement learning (MARL) faces significant challenges in scaling communication protocols, particularly as the number of agents ( $N$ ) increases. Mitigating partial observability via communication is crucial, but existing methods often rely on learning pairwise communication links, incurring high computational costs (often quadratic in $N$ ) and struggling to identify relevant connections in large systems. The paper "Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning" (2502.19717) introduces ExpoComm, a communication protocol designed explicitly for scalability by shifting focus from learned pairwise links to the design of an efficient global communication topology.

ExpoComm Framework

ExpoComm leverages the structural properties of exponential graphs to facilitate rapid and efficient information dissemination across large agent populations. Instead of learning connectivity, it employs a pre-defined, rule-based topology that guarantees desirable scaling properties.

Exponential Graph Topology

The core idea is to utilize a communication graph with both a small diameter (ensuring fast information spread) and small size (ensuring low communication overhead). Exponential graphs satisfy these criteria. The paper considers two variants:

Static Exponential Graph: Agents are arranged conceptually on a ring. Each agent $i$ connects to neighbors at distances $2^k$ in both clockwise and counter-clockwise directions, for $k=0, 1, \ldots, \lfloor \log_2(N-1) \rfloor$ . The neighbors of agent $i$ are $\{(i \pm 2^k) \pmod N \mid k=0, \ldots, \lfloor \log_2(N-1) \rfloor\}$ . This graph has a diameter of $\lceil \log_2(N-1) \rceil$ and a size (total number of directed edges) of $O(N \log N)$ . Each agent has a degree of $O(\log N)$ .
One-peer Exponential Graph: To further reduce communication overhead, this variant introduces sparsity dynamically. At timestep $t$ , agent $i$ connects to only one peer agent $(i + 2^k) \pmod N$ , where $k = t \pmod {\lceil \log_2(N-1) \rceil + 1}$ . Over a cycle of $\lceil \log_2(N-1) \rceil + 1$ steps, each agent effectively communicates along the links of the static exponential graph. This reduces the instantaneous communication degree to 1 for each agent, resulting in a graph size of $O(N)$ per timestep. The effective diameter, considering information propagation over multiple steps, remains $O(\log N)$ .

The logarithmic diameter ensures that information originating from any agent can reach any other agent within a number of steps that grows very slowly with $N$ . The (near-)linear size ensures that the communication cost remains tractable for large systems.

Neural Architecture and Message Processing

To effectively utilize the multi-hop information propagation inherent in the small-diameter topology, ExpoComm employs memory-based message processors. Each agent $i$ maintains a message state $m_i^t$ . The message update rule depends on the chosen topology:

Static Graph: An agent updates its message based on its previous message and messages received from all $O(\log N)$ neighbors. This is amenable to permutation-invariant aggregation mechanisms like Attention:

$m_i^t = \text{Attention}(m_i^{t-1}, \{m_j^{t-1} \mid j \in \mathcal{N}_i\})$

where $\mathcal{N}_i$ are the neighbors of agent $i$ in the static exponential graph.
One-peer Graph: An agent receives only one message per timestep from its designated peer $j$ . A recurrent neural network (RNN, e.g., GRU) is well-suited for integrating this sequential information over time:

$m_i^t = \text{GRU}(m_i^{t-1}, m_j^{t-1})$

where $j = (i + 2^k) \pmod N$ with $k = t \pmod {\lceil \log_2(N-1) \rceil + 1}$ .

These processors allow agents to implicitly accumulate information propagated through the exponential graph over the $\approx \log N$ steps required to bridge the graph diameter.

class ExpoCommAgent:
    def __init__(self, agent_id, num_agents, use_static_graph=True, message_dim=64):
        self.agent_id = agent_id
        self.num_agents = num_agents
        self.use_static_graph = use_static_graph
        self.message_state = torch.zeros(message_dim) # m_i^{t-1}

        if use_static_graph:
            # Precompute static neighbors based on exponential graph rule
            self.neighbors = self._compute_static_neighbors()
            # Initialize Attention module (e.g., MultiHeadAttention)
            self.message_processor = AttentionModule(message_dim)
        else:
            # Initialize GRU cell
            self.message_processor = GRUCell(message_dim, message_dim)

    def _compute_static_neighbors(self):
        neighbors = set()
        for k in range(int(math.log2(self.num_agents - 1)) + 1):
            dist = 1 << k
            neighbors.add((self.agent_id + dist) % self.num_agents)
            neighbors.add((self.agent_id - dist + self.num_agents) % self.num_agents)
        return list(neighbors - {self.agent_id}) # Exclude self-loops if needed

    def _get_one_peer_neighbor(self, timestep):
        log_n = int(math.log2(self.num_agents - 1)) + 1
        k = timestep % log_n
        dist = 1 << k
        peer_id = (self.agent_id + dist) % self.num_agents
        return peer_id

    def update_message(self, received_messages, timestep):
        """
        received_messages: Dict mapping agent_id -> message_tensor (m_j^{t-1})
        """
        if self.use_static_graph:
            neighbor_messages = [received_messages[j] for j in self.neighbors if j in received_messages]
            if not neighbor_messages:
                 # Handle case with no neighbors or no received messages
                 neighbor_messages_tensor = torch.zeros(0, self.message_state.shape[0])
            else:
                 neighbor_messages_tensor = torch.stack(neighbor_messages)

            # Process using Attention: Query=prev_message, Key/Value=neighbor_messages
            # Simplified view:
            # new_message = self.message_processor(self.message_state, neighbor_messages_tensor)

            # Actual implementation depends on the specific Attention mechanism used
            # For example, using PyTorch's MultiheadAttention:
            # Assume self.message_processor = nn.MultiheadAttention(...)
            # query = self.message_state.unsqueeze(0).unsqueeze(0) # Shape (1, 1, embed_dim)
            # key = value = neighbor_messages_tensor.unsqueeze(1) # Shape (num_neighbors, 1, embed_dim)
            # attn_output, _ = self.message_processor(query, key, value)
            # new_message = attn_output.squeeze(0).squeeze(0) # Shape (embed_dim)

            # Simpler aggregation like mean/max followed by MLP is also possible
            if neighbor_messages_tensor.shape[0] > 0:
                aggregated_neighbors = torch.mean(neighbor_messages_tensor, dim=0)
                combined_input = torch.cat((self.message_state, aggregated_neighbors), dim=0)
                # Pass combined_input through an MLP (part of message_processor)
                new_message = self.message_processor(combined_input) # Assuming MLP processor
            else:
                new_message = self.message_processor(torch.cat((self.message_state, torch.zeros_like(self.message_state)), dim=0))

        else: # One-peer graph
            peer_id = self._get_one_peer_neighbor(timestep)
            if peer_id in received_messages:
                peer_message = received_messages[peer_id]
            else:
                 # Handle case where peer message is not available (e.g., beginning of episode)
                 peer_message = torch.zeros_like(self.message_state)

            # Process using GRU: Input=peer_message, HiddenState=prev_message
            new_message = self.message_processor(peer_message.unsqueeze(0), self.message_state.unsqueeze(0))
            new_message = new_message.squeeze(0)

        self.message_state = new_message # Update m_i^t
        return self.message_state

    def get_action(self, observation):
        # Action selection uses local observation and current message state
        # policy_input = torch.cat((observation, self.message_state), dim=...)
        # action = self.policy_network(policy_input)
        pass

Message Grounding via Auxiliary Tasks

To ensure the propagated messages $m_i^t$ contain globally relevant information useful for decision-making, ExpoComm incorporates auxiliary tasks during the centralized training phase (within a CTDE framework like QMIX or VDN). Two alternatives are proposed:

Global State Prediction: If the global state $s^t$ is accessible during training, a decoder network attempts to reconstruct $s^t$ from each agent's message $m_i^t$ . The loss is typically the Mean Squared Error (MSE):

$\mathcal{L}^{\text{Aux}_{\text{pred}}} = \frac{1}{N} \sum_i ||\text{Decoder}(m_i^t) - s^t||^2$
Contrastive Learning: When the global state is unavailable or too high-dimensional, a contrastive objective (InfoNCE) is used. Messages from different agents at the same timestep ( $m_i^t, m_j^t$ for $i \neq j$ ) are treated as positive pairs, encouraged to be similar. Messages from significantly different timesteps ( $m_i^t, m_j^{t'}$ where $|t-t'| > \tau$ ) are negative pairs, pushed apart. This encourages messages to encode a shared representation of the current global context.

$\mathcal{L}^{\text{Aux}_{\text{cont}}} = -\frac{1}{N} \sum_i \log \frac{\sum_{j \neq i} \exp(\text{sim}(m_i^t, m_j^t) / \theta)}{\sum_{j \neq i} \exp(\text{sim}(m_i^t, m_j^t) / \theta) + \sum_{j'} \sum_{t' : |t-t'| > \tau} \exp(\text{sim}(m_i^t, m_{j'}^{t'}) / \theta)}$

where $\text{sim}$ is a similarity function (e.g., cosine similarity) and $\theta$ is a temperature parameter.

The total loss combines the standard MARL task loss (e.g., TD-error from QMIX) with the auxiliary loss, weighted by a coefficient $\alpha$ : $\mathcal{L} = \mathcal{L}^{\text{TD}} + \alpha \cdot \mathcal{L}^{\text{Aux}}$ .

Scalability and Theoretical Advantages

ExpoComm's design directly addresses the scalability limitations of prior communication protocols:

Communication Cost: The number of messages transmitted per timestep scales as $O(N \log N)$ for the static graph and $O(N)$ for the one-peer graph. This contrasts favorably with fully-connected graphs ( $O(N^2)$ ) or attention-based methods like CommFormer that often compute pairwise attention scores ( $O(N^2)$ complexity).
Information Propagation Speed: The logarithmic diameter $O(\log N)$ guarantees that global information can theoretically disseminate across the entire network much faster than in topologies with larger diameters, such as chains or grids ( $O(N)$ or $O(\sqrt{N})$ ), or potentially sparse learned graphs.
Computational Cost: Using fixed topologies avoids the optimization overhead of learning communication links. While attention mechanisms in the static variant still have computational cost, they operate only over $O(\log N)$ neighbors per agent, compared to $O(N)$ in fully connected attention. The GRU-based one-peer variant is computationally very efficient. The overall training complexity, especially GPU memory usage, is significantly lower than methods involving dense pairwise interactions for large $N$ .

Experimental Validation

The effectiveness of ExpoComm was evaluated on large-scale cooperative MARL benchmarks, specifically MAgent (AdversarialPursuit, Battle) and the Infrastructure Management Planning (IMP) suite (Uncorrelated, Correlated, OWF), with agent populations ranging from $N=20$ to $N=100$ .

Performance Comparison

ExpoComm variants were compared against several baselines:

No communication (IDQN/QMIX)
Distance-based communication (DGN+TarMAC)
Random graph communication (ER graph + TarMAC)
Learned communication (CommFormer)

Key findings include:

ExpoComm consistently outperformed all baselines across the tested scenarios and agent numbers, often by a significant margin, especially at larger scales ( $N \ge 60$ ).
The results held under different communication budgets, specifically comparing ExpoComm with $K=\lceil\log_2(N-1)\rceil$ neighbors (static graph or multi-peer dynamic) and $K=1$ neighbor (one-peer graph).
Notably, the ExpoComm variant using the one-peer graph ( $O(N)$ communication cost) frequently achieved performance comparable to or even exceeding the static graph variant ( $O(N \log N)$ cost) and significantly outperformed baselines, demonstrating high communication efficiency. For instance, in MAgent Battle ( $N=100$ ), ExpoComm ( $K=1$ ) achieved nearly twice the win rate of the next best baseline (CommFormer).

Zero-Shot Transferability

A critical result highlighted is ExpoComm's robust zero-shot transfer capability. Models trained with a specific number of agents (e.g., $N=60$ ) were directly evaluated on scenarios with a larger number of agents (e.g., $N=100$ ) without any retraining.

ExpoComm demonstrated significantly better generalization performance compared to baselines in these transfer tasks. While baseline performance often degraded sharply, ExpoComm maintained a high level of effectiveness.
This robustness is attributed to the global, rule-based nature of the exponential topology, which does not depend on learning pairwise interactions specific to the training scale $N$ . The structure inherently adapts to different $N$ .

Ablation Studies

Ablations confirmed the necessity of key components:

Removing the memory-based message processors (RNN/Attention) led to a substantial performance drop, confirming their importance for integrating information over multiple hops.
Disabling the auxiliary message grounding tasks also degraded performance, showing their role in making the communicated messages meaningful for the task.

Implementation Considerations

ExpoComm offers practical advantages for implementation and deployment:

Simplicity: The rule-based nature of exponential graphs makes them straightforward to construct for any number of agents $N$ .
Flexibility: The choice between the static ( $O(N \log N)$ cost, parallel processing via Attention) and one-peer ( $O(N)$ cost, sequential processing via RNN) variants allows practitioners to trade off communication cost and computational parallelism based on application constraints.
Reduced Overheads: Avoids the complex search or learning procedures for communication links required by other methods, simplifying the training pipeline and reducing computational requirements, particularly GPU memory for large $N$ .

Conclusion

ExpoComm presents a scalable and effective approach to communication in large-scale MARL systems by leveraging the graph-theoretic properties of exponential topologies. Its design provides a principled way to achieve rapid global information dissemination with low, (near-)linearly scaling communication overhead. The strong empirical results, particularly the performance of the highly efficient one-peer variant and the robust zero-shot transferability, demonstrate its potential as a practical communication backbone for real-world multi-agent applications involving dozens or hundreds of agents.

Markdown Report Issue