Parametrized Sharing in Hybrid Multi-Agent RL

Updated 9 January 2026

The paper demonstrates how PMHRL improves sample efficiency and scalability through tailored parameter sharing across agents.
It details a network architecture decomposing shared trunks and agent-specific heads to balance coordination and specialization.
Empirical results reveal significant improvements in energy efficiency and convergence speed compared to non-sharing baselines.

A Parametrized Sharing Scheme for Multi-Agent Hybrid Deep Reinforcement Learning (PMHRL) is a design pattern and architectural framework that enables sample-efficient, scalable, and robust multi-agent reinforcement learning in domains with agents exhibiting heterogeneous or hybrid discrete-continuous action spaces. PMHRL leverages parameter sharing at multiple architectural levels—across all agents, within functional subgroups, or between sub-modules specialized for different decision types—while preserving sufficient representational capacity for agent specialization, behavior diversity, and coordination requirements in hybrid-action scenarios. This article synthesizes the landscape of PMHRL as seen in recent deep RL research, focusing on network decomposition, sharing modalities, optimization strategies, and empirical findings.

At its core, PMHRL generalizes classical parameter sharing in multi-agent reinforcement learning—wherein all agents share a single policy or value network—to architectures that balance shared and agent-specific capacity, as dictated by behavioral heterogeneity and action-space structure. A canonical PMHRL system organizes the agent population and their associated networks along three axes:

Functionality: Handling both discrete (e.g., mode selection, activation/deactivation) and continuous (e.g., beamforming, control) actions via dedicated sub-networks (e.g., DQN for discrete, PPO/DDPG for continuous variables).
Parametric granularity: Allowing global, group-wise, or role-based sharing of backbone networks (encoders, state abstraction layers) while optionally retaining private "heads" (action selectors, critics) per agent.
Sharing modality: Bridging sub-modules by concatenating, projecting, or otherwise communicating the outputs or latent representations of one module (e.g., the discrete head's choice) as context for the next (e.g., continuous PPO actor’s policy) (Kuo et al., 2 Jan 2026, Shen et al., 22 Jul 2025, Fu et al., 2019).

PMHRL thereby simultaneously addresses sample efficiency, communication overhead, agent diversity, and action-space expressivity in hybrid or highly parameterized environments.

The network design for PMHRL decomposes each agent's architecture into:

Shared Trunk/Encoder: A set of layers shared across agents or agent groups for common state abstraction, such as a variational autoencoder (VAE) for semantic state compression in the CHIMERA framework (Shen et al., 22 Jul 2025).
Agent/Module-Specific Heads: Action-selection heads specialized for each agent (fully private), subgroup (partially shared), or function (e.g., discrete vs. continuous control).
Hybrid Branching: Discrete sub-networks (e.g., DQN) select modes or high-level actions, whose outputs are concatenated or re-encoded and fed to continuous sub-networks (e.g., PPO/actor-critic) that produce fine-grained continuous actions. For example, in MF-RIS resource allocation, the PPO actor for an MF-RIS agent receives as input both the current physical state and the DQN-selected activation vector for each element (Kuo et al., 2 Jan 2026).
Cross-Module Sharing: The decision or representation from one sub-module (e.g., previous discrete action) is injected as input into the other, establishing a parametric dependence between decision layers (see Fig. 1 in (Kuo et al., 2 Jan 2026) and Section 2 of (Shen et al., 22 Jul 2025)).

The scheme enables branching by action-type (discrete/continuous), by agent role (e.g., base station vs. RIS), or both, leading to flexible hybrid policies naturally aligned with the structure of mixed-action environments (Fu et al., 2019, Shen et al., 22 Jul 2025).

3. Mathematical Formulation and Learning Algorithms

The optimization objectives and algorithms in PMHRL reflect the architectural decomposition:

Hybrid Action Spaces: Each agent $i$ has a discrete-continuous hybrid action $a_i = (k_i, x_{i,k_i})$ , e.g., mode selection $k_i$ and parameter $x_{i,k_i}$ (Fu et al., 2019).
Parameter-Sharing Objective: The global objective in cooperative settings is to maximize (possibly penalized) reward subject to constraints such as energy efficiency or communication quality (Kuo et al., 2 Jan 2026, Shen et al., 22 Jul 2025):

$J(\phi, \{\theta_i\}) = \lim_{T \to \infty} \frac{1}{T} \sum_{t=1}^T \sum_{i=1}^N \mathrm{EE}_i(t)$

where $\phi$ denotes shared parameters (e.g., VAE weights) and $\theta_i$ are agent-specific policy/critic/dqn weights.

Learning Algorithm:
- For discrete modules (DQN): Each agent solves a Bellman TD objective
$L^{DQN}_q(\omega_q) = \mathbb{E}_{(s_q,a_q,r,s'_q)} \bigl[ (y_q - Q_q(s_q,a_q|\omega_q))^2 \bigr]$

with target $y_q = r + \gamma^d \max_{a'} Q'_q(s'_q, a'|\omega^-_q)$ (Kuo et al., 2 Jan 2026). - For continuous modules (PPO/DDPG): Each agent optimizes a clipped surrogate loss

$L^{PPO}_q(\delta_q) = \mathbb{E}_t \left[ \min \bigl( r_t(\delta_q) \hat{A}_q(t), \text{clip}(r_t(\delta_q), 1-\epsilon, 1+\epsilon) \hat{A}_q(t) \bigr) \right]$

with $r_t(\delta_q)$ being the policy probability ratio and $\hat{A}_q(t)$ the generalized advantage estimator (Kuo et al., 2 Jan 2026). - Cross-module parameterized sharing: The key step is to concatenate the current discrete action as context for the continuous actor and vice versa (Shen et al., 22 Jul 2025, Kuo et al., 2 Jan 2026).
Training Loop: An episode interleaves DQN-based discrete action selection, parameterized sharing to construct continuous module inputs, PPO/DDPG-based continuous action selection, environment stepping, and replay buffer updates. Updates to network parameters occur using respective submodule-specific objectives, with shared buffers for the trunk or VAE backbones (Shen et al., 22 Jul 2025, Kuo et al., 2 Jan 2026).

PMHRL subsumes a broad spectrum of sharing schemes, including:

Full parameter sharing: All agents share one set of policy parameters; successful for homogeneous agents and environments (Kaushik et al., 2018, Terry et al., 2020).
Selective/Group-wise Sharing: Agents are partitioned into groups based on behavioral, physical, or goal similarity (using embeddings or clustering), with one policy per group, as in the Selective Parameter Sharing (SePS) method (Christianos et al., 2021). Partitioning is typically performed using clustering over agent embeddings learned by a VAE, which are obtained by encoding transition data (agent $i$ 's observation, action, reward, next observation) and grouping via k-means (Christianos et al., 2021).
Partial Parameter Sharing: Shared "backbones" (e.g., encoders or low-level layers) and agent-private "heads" (output/decision layers) to capture agent specialization while leveraging shared information, as in the FP3O pipeline (Feng et al., 2023).
Adaptive Masking and Hypernetworks: Hybrid partial-sharing schemes such as Kaleidoscope employ learnable masks on a master parameter vector to dynamically control which weights are shared vs. private, with diversity regularization to promote agent–policy diversity (Li et al., 2024). HyperMARL utilizes hypernetworks conditioned on agent embeddings to generate per-agent weights, interpolating smoothly between full-sharing and independence (Tessera et al., 2024).

These extensions allow PMHRL to interpolate continuously between the extremes of full sharing (maximum sample efficiency, minimum diversity) and full independence (maximum specialization, minimum shared knowledge), with partial schemes often yielding better performance and stability in heterogeneous, high-dimensional, or nonstationary domains (Shen et al., 22 Jul 2025, Li et al., 2024, Tessera et al., 2024, Christianos et al., 2021, Feng et al., 2023).

5. Practical Implementations and Key Empirical Results

Empirical studies consistently demonstrate the advantages of PMHRL over non-hybrid or non-sharing baselines:

In hybrid wireless systems (e.g., MF-RIS-aided NOMA networks), PMHRL outperforms pure PPO, pure DQN, and ablations without parameterized sharing, achieving both faster convergence (∼30%–50% wall-clock speedup) and higher energy efficiency (Kuo et al., 2 Jan 2026, Shen et al., 22 Jul 2025).
CHIMERA framework (SAGIN scenario): Parametrized sharing between discrete DQN and continuous DDPG modules, plus shared VAE compression, roughly doubled the energy efficiency relative to ablated variants (Shen et al., 22 Jul 2025).
Multi-agent driving and traffic behaviors: PS-DDPG demonstrated that a single conditional actor-critic network can simultaneously learn and generalize over multiple behaviors (lane-keeping, overtaking) simply by conditioning the input on behavior ID (Kaushik et al., 2018).
Benchmarks in high-dimensional tasks: Reward-scaled periodic sharing (RS-PPS) and partial personalized sharing (PP-PPS) delivered 10%–50% improvements in convergence and were able to solve tasks where fully independent or global sharing failed (Zhang et al., 2024).
Role of cross-module context: Explicitly sharing discrete module outputs to condition continuous action selection (and vice versa) is critical. Without such information flow, hybrid architectures fail to coordinate decisions, resulting in suboptimal or unstable learning (Kuo et al., 2 Jan 2026, Shen et al., 22 Jul 2025).

A summary table of empirical improvements comparing PMHRL with baseline alternatives is given below.

Domain / Scenario	Sharing Architecture	Performance Improvement
MF-RIS-aided NOMA (Kuo et al., 2 Jan 2026)	DQN+PPO, param. sharing	+30% EE vs. independent/homogeneous baselines
SAGIN MF-RIS (CHIMERA) (Shen et al., 22 Jul 2025)	PPO/DDPG+DQN, VAE trunk, param. sharing	+20–30% EE vs. PPO/DQN, +2× EE vs. no sharing
Multi-agent driving (Kaushik et al., 2018)	Shared actor-critic, conditioned on ID	Faster convergence, generalized multi-behavior
SMAC, LBF, RWARE (Zhang et al., 2024, Christianos et al., 2021)	Periodic/Selective/Partial sharing	10–50% speed/return; success on previously unsolved

*EE: Energy Efficiency. All results reported as stated in the respective sources.

6. Theoretical Guarantees and Convergence Properties

Several variants of PMHRL and its subcomponents are supported by theoretical analyses:

Universal representational capacity: With agent indication (inputting agent ID or behavior ID), parameter-shared policies are provably able to represent the set of all optimal individual policies, even with heterogeneous agents and observation/action spaces, provided learned networks are sufficiently expressive and equipped with padding and masking (Terry et al., 2020).
Monotonic improvement: The FP³O algorithm, which is compatible with all sharing granularities, admits a shared lower bound guaranteeing monotonic improvement on the joint return, regardless of the degree of parameter sharing (full, groupwise, or independent) (Feng et al., 2023).
Hybrid action space expressivity: With hierarchical or groupwise parameter sharing, PMHRL achieves both centralization (during learning) and full decentralization (during execution), ensuring scalability and feasible agent specialization in practical settings (Fu et al., 2019, Feng et al., 2023).

7. Practical Recommendations and Future Directions

PMHRL provides a versatile design space for researchers and practitioners. Recommendations include:

Hybridization of action types: Model discrete and continuous variables in separate, coordinated network modules, with explicit parameter sharing to enable context-dependent action selection (Kuo et al., 2 Jan 2026, Fu et al., 2019).
Inclusion of agent/role indicators: Append behavior, agent, or role IDs to shared network inputs to recover full flexibility and representational power in heterogeneous environments (Kaushik et al., 2018, Terry et al., 2020).
Partial parameter sharing as default: Where the agent population is more diverse, use clustering or mask-based schemes (e.g., Kaleidoscope, SePS) to learn an appropriate sharing structure (Li et al., 2024, Christianos et al., 2021).
Cross-module context passing: For hybrid action decomposition, share the output of one network (DQN/PPO) as input context for the other to enable coordinated hybrid-action policies (Shen et al., 22 Jul 2025).
Application-specific tuning: Adjust the proportion of shared vs. agent-specific layers, diversity regularization strengths, and optimizer settings to match the heterogeneity, sample complexity, and stability requirements of the target environment (Zhang et al., 2024, Feng et al., 2023, Tessera et al., 2024).

A plausible implication is that future PMHRL advances will integrate masking, agent embeddings, and explicit communication- and action-parameter compression into unified, context-adaptive schemes, enhancing coordination and learning efficiency in complex multi-agent, hybrid-action domains.

Markdown Report Issue Upgrade to Chat

References (10)

Parametrized Sharing for Multi-Agent Hybrid DRL for Multiple Multi-Functional RISs-Aided Downlink NOMA Networks (2026)

CHIMERA: Compressed Hybrid Intelligence for Twin-Model Enhanced Multi-Agent Deep Reinforcement Learning for Multi-Functional RIS-Assisted Space-Air-Ground Integrated Networks (2025)

Deep Multi-Agent Reinforcement Learning with Discrete-Continuous Hybrid Action Spaces (2019)

Parameter Sharing Reinforcement Learning Architecture for Multi Agent Driving Behaviors (2018)

Revisiting Parameter Sharing in Multi-Agent Deep Reinforcement Learning (2020)

Scaling Multi-Agent Reinforcement Learning with Selective Parameter Sharing (2021)

FP3O: Enabling Proximal Policy Optimization in Multi-Agent Cooperation with Parameter-Sharing Versatility (2023)

Kaleidoscope: Learnable Masks for Heterogeneous Multi-agent Reinforcement Learning (2024)

HyperMARL: Adaptive Hypernetworks for Multi-Agent RL (2024)

10.

PPS-QMIX: Periodically Parameter Sharing for Accelerating Convergence of Multi-Agent Reinforcement Learning (2024)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Parametrized Sharing Scheme for Multi-Agent Hybrid Deep Reinforcement Learning (PMHRL).

Parametrized Sharing in Hybrid Multi-Agent RL

1. Foundations of Parametrized Sharing in Multi-Agent Hybrid RL

2. Network Architecture and Parametrized Sharing Mechanisms