Role-Conditioned Policy Learning in MARL

Updated 22 January 2026

Role-conditioned policy learning is a framework in multi-agent reinforcement learning that integrates role embeddings to tailor agent strategies and improve coordination.
It combines self role inputs with opponent role predictions to enable anticipatory decision-making and adaptive behavior in both cooperative and competitive settings.
Experimental results in scenarios like Overcooked, Harvest, and Touch-Mark demonstrate that role-conditioned policies achieve superior performance and strategic diversity compared to baseline methods.

Role-conditioned policy learning is a framework in multi-agent reinforcement learning (MARL) whereby agents explicitly condition their policies on role-related embeddings or representations. This approach is motivated by the observation that role diversity—whether externally assigned, inferred, or emergent—facilitates robust coordination, strategizing, and adaptability, especially in complex environments involving cooperation, competition, or both. State-of-the-art role-conditioned policy methods not only condition the agent’s behavior on its own role but also incorporate predictions or inferences about the roles of other agents, thereby equipping policies with anticipatory and adaptive capabilities in diverse multi-agent interactions (Long et al., 2024, &&&1&&&).

1. Formal Foundations of Role-Conditioned Policies

In role-conditioned MARL, agent behavior is defined within a multi-agent Markov game

$\mathcal{M} = \bigl\langle m, \mathcal{S}, \mathcal{A} = \prod_{i=1}^m A^i, \mathcal{P}, \{\mathcal{R}^i\}_{i=1}^m, \{\mathcal{O}^i\}_{i=1}^m, \psi, \mathbb{Z}, \gamma \bigr\rangle$

where $m$ is the number of agents, $\mathbb{Z}$ is the role embedding space, and $\psi$ is a role-based reward shaping operator. At the beginning of each episode, each agent $i$ is assigned a role embedding $z^i \in \mathbb{Z}$ (either sampled or inferred). The agent’s policy is explicitly conditioned on its role: $\pi^i(a^i_t \mid o^i_t, z^i)$ producing the joint policy

$\pi(\mathbf{a}_t \mid \mathbf{o}_t, \mathbf{z}) = \prod_{i=1}^m \pi(a^i_t \mid o^i_t, z^i)$

This explicit conditioning enables a single policy to encode multiple role-specialized strategies. In some actor-critic frameworks with emergent roles, agents maintain role encoders $h_S$ that output role embeddings $\rho_i^t$ sampled as

$\rho_i^t \sim P(\rho_i^t \mid o_i^t, a_i^{t-1})$

and incorporate role predictions of others $\hat\rho_{j}^t$ as determined via opponent-role prediction modules $h_O$ (Koley et al., 2023).

2. Architectural Implementations and Learning Objectives

Role-conditioned approaches integrate role signals at several levels:

Policy Network Inputs: Observations are fused with role embeddings, typically via concatenation after encoding. For example, a policy may process a concatenation of an LSTM-encoded observation vector (size 128) and a 64-D role embedding (discrete or continuous), propagating this through fully-connected layers to produce both a policy head $\pi(a\mid o,z)$ and a value head $V(o,z)$ (Long et al., 2024). In opponent-aware variants, the policy receives both self-role and opponent-role predictions as input:

$\pi_i(a_i \mid o_i, \rho_i, \hat\rho_i^o)$

Role Predictor Modules: Predictors $q_\phi$ (in RP) or $h_O$ (in RAC) estimate the role embeddings of other agents based on the agent’s own observation-action history and role. The RP predictor outputs log-probabilities over possible joint roles of the other agents; in the policy, the agent may use this prediction to further inform its actions.
Role Encoder and Diversity Objectives: To ensure meaningful and diverse roles, loss terms are introduced based on mutual information, diversity regularization, and opponent-role prediction accuracy:
- Mutual information ensures that roles are identifiable and inferable from trajectories.
- Diversity loss encourages specialization among teammates.
- Role prediction loss promotes accurate anticipation of other agents’ roles (Koley et al., 2023).
Reinforcement Learning Optimization: Role-conditioned policies are optimized via standard RL objectives (e.g., PPO for RP, soft actor-critic for RAC), extended to include the expectation over sampled or inferred roles and shaped rewards as per $\psi$ . For RP, the cumulative objective is:

$J(\theta) = \mathbb{E}\Bigl[\sum_{t=0}^\infty \gamma^t\,\psi(\mathcal{R}(s_t, \mathbf{a}_t), z^i)\Bigr]$

Additional entropy regularization is often included.

3. Sampling and Representing Roles

Sampling and representation of roles are central to coverage and interpretability. In role play (RP), the discrete set of $K=8$ SVO-inspired angles

$\left\{ z^i = \frac{k\pi}{4} \mid k = -4, -3, \dots, 3 \right\}$

span a range from “masochistic” to “martyr,” providing interpretable, canonical roles uniformly sampled for each agent per episode. This discrete structure is both expressive (covering a spectrum of behavioral tendencies) and tractable for prediction and policy conditioning. In continuous role spaces (as in RAC), role encoders and predictors parameterize Gaussian distributions, supporting richer and potentially more adaptive role representations (Long et al., 2024, Koley et al., 2023).

In all cases, sampling is conducted per episode, with no explicit clustering required for the canonical SVO roles due to their coverage and interpretability.

4. Theoretical Guarantees and Algorithmic Frameworks

Crucial to the role-conditioned paradigm are formal results on optimality and training stability. In RP, Theorem 4.1 establishes that if a random policy $\pi'$ is $\epsilon$ -close to a role policy $\pi(z')$ : $\left| \frac{\pi'(a \mid o)}{\pi(a \mid o, z')} - 1 \right| < \epsilon \quad \forall a, o$ then, over a $T$ -step MDP, the expected cumulative reward deviates at most by $\epsilon T$ relative to the optimal role-matched policy: $\left|\frac{J(\pi(z),\pi')}{J(\pi(z),\pi(z'))} -1 \right| \leq \epsilon T$ This provides a formal foundation for both approximate role imputation and shared-policy representations (Long et al., 2024).

End-to-end training algorithms in both RP and RAC decompose into episodic data collection (with fresh role assignments or sampled roles), episodic rollouts with per-timestep role prediction and reward shaping, followed by alternating updates to role predictors and the policy using respective loss functions. Table 1 summarizes the respective modules:

Approach	Self Role Input	Opponent Role Prediction	Policy Head
RP	Discrete/cont. $z^i$	Yes ( $q_\phi$ )	$\pi(a\mid o, z)$
RAC	Gaussian $\rho_i$	Yes ( $h_O$ )	$\pi_i(a\mid o_i, \rho_i, \hat\rho_i^o)$

5. Experimental Results and Benchmarks

Role-conditioned policies are evaluated on both cooperative and mixed-motive environments:

RP Experiments (Long et al., 2024):
- Overcooked (cooperative): RP achieves high zero-shot coordination (average score 27.1±9.2) compared to baselines (AnyPlay, TrajeDi, BRDiv ≈5–10, HSP 18.3±7.4).
- Harvest and CleanUp (mixed-motive): RP outperforms other methods in both collective and individual metrics (Harvest: 23.57±5.56; CleanUp: 38.09±7.71 for $z=\pi/4$ ).
- Qualitative adaptability: RP flexibly adapts to partner strategies—adjusting between prosocial and competitive behaviors as roles change.
RAC Experiments (Koley et al., 2023):
- Touch-Mark (team competition): RAC demonstrates increased success rates in pursuit-evasion compared to MAAC and "Team" variants.
- Market: RAC achieves higher mean resources delivered and stronger strategic specialization with role and opponent-role awareness.
- Ablation studies: Removal of mutual information, diversity, or opponent-role losses negatively affects performance, indicating all are integral for role learning and exploitation.

Baselines in both studies include non-role-conditioned and pool-based policy methods, with consistent findings that explicit role conditioning and opponent-role predictions yield quantitatively superior coordination and strategic diversity.

6. Comparative Perspectives and Limitations

Earlier role-emergence methods such as ROMA, RODE, and ROGC are primarily founded on Q-learning with value-factorization, well-suited for shared-reward, cooperative MARL, but less so for mixed or competitive settings. Actor-critic role-conditioned methods (RP, RAC) support both decentralized execution and per-agent specialization, extending applicability to team-competition and mixed-reward tasks (Koley et al., 2023).

Opponent modeling methods (ROMMEO, TDOM-AC, PR2) focus on explicit policy or behavior modeling of individual opponents, chiefly in 1–1 settings; however, these do not leverage role abstraction or decomposition at the team level.

A plausible implication is that role-conditioned, opponent-aware approaches can scale to more complex domains, given their ability to induce diversity, identifiability, and anticipatory adaptation in decentralized agents.

Limitations include increased hyperparameterization (e.g., annealing rates, role embedding dimension, update frequency) and open questions on scalability to large agent populations, hierarchical or latent role structures, and efficient inference under partial observability (Koley et al., 2023).

References:

"Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions" (Long et al., 2024)
"Opponent-aware Role-based Learning in Team Competitive Markov Games" (Koley et al., 2023)

Markdown Report Issue Upgrade to Chat

References (2)

Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions (2024)

Opponent-aware Role-based Learning in Team Competitive Markov Games (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Role-Conditioned Policy Learning.