Decentralized Multi-Task Learning

Updated 30 January 2026

Decentralized Multi-Task Representation Learning is a framework where distributed agents collaboratively learn common low-dimensional features and task-specific parameters without a central server.
It employs dynamic communication graphs and clustering methods to efficiently manage heterogeneous data and minimize communication overhead across agents.
The approach leverages decentralized optimization algorithms like SGD and PGD with provable convergence properties, resulting in faster training and improved scalability.

Decentralized Multi-Task Representation Learning (Dec-MTRL) refers to a family of algorithms and theoretical frameworks where multiple agents (nodes, devices, or clients) collaboratively learn representations that facilitate solving several distinct tasks, without reliance on a centralized server. These agents are connected by a sparse or dynamic communication graph and possess heterogeneous data distributions and objectives. Dec-MTRL’s core motivation is to extract common, typically low-dimensional, feature representations that benefit all tasks, while efficiently handling pronounced heterogeneity, minimizing communication, and accelerating convergence in distributed environments.

1. Formal Problem Definition and Representation Models

Dec-MTRL encompasses supervised, reinforcement learning, and online regression paradigms, unified by the goal of recovering shared representations (feature matrix/subspace or shared policy/network backbone) and task-specific parameters across a decentralized network.

General Model Structure

Each agent $i$ holds data $\mathcal{X}_i$ for task $i$ .
The shared representation is parameterized by $W$ (or its local copy $\theta_{s,i}$ ).
Each agent has a private head or task-specific layers $h_i$ ( $\theta_i$ ).
The objective is:

$\min_{\{\theta_{s,i},\,\theta_i\}} \;\frac1N\sum_{i=1}^N \mathcal{L}_i(\mathcal{X}_i;\,\theta_{s,i},\theta_i)$

In multi-task linear regression, task parameters $\Theta^\star=[\theta_1^\star,\dots,\theta_T^\star]$ are assumed low-rank:

$\Theta^\star = U^\star B^\star, \quad U^\star\in\mathbb{R}^{d\times r},\, (U^\star)^\top U^\star=I_r,\, B^\star\in\mathbb{R}^{r\times T}$

where $\mathcal{X}_i$ 0 is the shared latent representation and $\mathcal{X}_i$ 1 encodes task-specific coefficients (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).

Hybrid Architectures

PF-MTL: Personalized Federated Multi-Task Learning, with a shared backbone and private heads.
ColNet: Model split into backbone ( $\mathcal{X}_i$ 2) and task-specific layers ( $\mathcal{X}_i$ 3), with explicit task grouping and leader-based cross-task aggregation (Feng et al., 17 Jan 2025).
Reinforcement learning variants seek a shared policy vector $\mathcal{X}_i$ 4 maximizing entropy-regularized value across tasks/environments (Zeng et al., 2020).
Online learning: Agents adapt local parameters as $\mathcal{X}_i$ 5, where $\mathcal{X}_i$ 6 spans the common subspace (Chen et al., 2017).

2. Communication Graphs, Task Correlation, and Aggregation Schemes

Dec-MTRL operates over undirected, directed, or time-varying graphs $\mathcal{X}_i$ 7, with decentralized communication protocols enabling only peer-to-peer exchange.

Graph Dynamics and Task Clustering

Dynamic adaptation: Mixing matrices $\mathcal{X}_i$ 8 are iteratively updated via gradient-based spectral clustering, which identifies clusters of positively correlated tasks and isolates negatively correlated ones (Mortaheb et al., 2022).
Static grouping: ColNet pre-assigns clients to task groups; intra-group backbone aggregation is followed by cross-group leader coordination using conflict-averse schemes (Feng et al., 17 Jan 2025).
Consensus averaging and gossip protocols: Used to synchronize shared representations in both reinforcement learning and regression settings, enforced via doubly-stochastic $\mathcal{X}_i$ 9 matrices (Mortaheb et al., 2022, Zeng et al., 2020, Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).

Aggregation Mechanisms

Gradient exchange and transference matrices: Quantify inter-task similarity, serving as a basis for spectral clustering and dynamic graph updating (Mortaheb et al., 2022).
HCA aggregation: Hyper conflict-averse aggregation among leaders mitigates gradient conflicts in multi-task federated learning (Feng et al., 17 Jan 2025).
Diffusion/ATC: Adapt-then-combine strategies diffuse the common component while preserving node-specific terms (Chen et al., 2017).
Local least-squares followed by decentralized projected GD: Alternating minimization for shared subspace $i$ 0 and task-specific $i$ 1 (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).

3. Optimization Algorithms and Convergence Theory

Dec-MTRL employs variants of decentralized stochastic gradient descent (SGD), projected gradient descent (PGD), and policy gradient methods, frequently augmented with consensus/diffusion operations.

Algorithmic Steps

Approach	Shared Update	Private Update	Communication
Dynamic clustering (Mortaheb et al., 2022)	Gossip + SGD	Local SGD	Gradient similarity, clustering
ColNet (Feng et al., 17 Jan 2025)	Leader-based agg.	Local SGD	Leader cross-group polling
Policy Gradient (Zeng et al., 2020)	Consensus PG	N/A	Parameter exchange
Linear regression (Kang et al., 27 Dec 2025)	Diffusion PGD	Local least-squares	$i$ 2 matrix exchange
Online LMS (Chen et al., 2017)	ATC diffusion	LMS with leak	Projection-based sharing

Initialization via decentralized spectral/truncated SVD for low-rank models (Kang et al., 29 Dec 2025).
Alternating minimization in $i$ 3/ $i$ 4, consensus rounds for synchronization, QR projection to enforce orthonormality (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).
Local step-size tuning and regularization for stability in streaming/online regimes (Chen et al., 2017).

Convergence Properties

Dynamic graph adaptation increases the spectral gap of each subgraph, empirically resulting in faster convergence than static graphs (Mortaheb et al., 2022).
Linear convergence in subspace distance with provable sample and communication complexity bounds (see below) (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).
Finite-time $i$ 5-stationarity in decentralized policy gradient; global optimality under alignment conditions (Zeng et al., 2020).
Stability and mean-square-error guarantees for both hard orthogonality and regularized models (Chen et al., 2017).
Communication complexity for recent algorithms is decoupled from target accuracy $i$ 6 (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).

4. Sample, Time, and Communication Complexity

Recent advances characterize the scaling of complexity parameters for Dec-MTRL.

Key Metrics

Sample complexity: $i$ 7 sufficient for $i$ 8-accurate feature recovery in low-rank models (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).
Time complexity: Each gradient descent iteration costs $i$ 9 for all nodes. With initialization and main iterations, total runtime scales as $W$ 0 for consensus rounds $W$ 1 (Kang et al., 27 Dec 2025).
Communication complexity: Each round involves $W$ 2 transmissions per node. Dif-AltGDmin and similar protocols make the total communication independent of $W$ 3 and logarithmically dependent on network/topology parameters (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).
Algorithmic efficiency: For large sparse networks, decentralized protocols surpass centralized federated approaches, with empirical results confirming reduced runtime and communication (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).

5. Empirical Results and Application Domains

Empirical studies and benchmarks substantiate Dec-MTRL’s efficacy.

Synthetic and Benchmark Datasets

Synthetic Gaussian/linear regression: Networks ( $W$ 4) exhibit robust, rapid convergence, even under sparse communication (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).
CelebA: Face attribute extraction and landmark detection, with dynamically clustered tasks converging 20–30 epochs earlier than baselines (Mortaheb et al., 2022, Feng et al., 17 Jan 2025).
CIFAR-10: Label and task heterogeneity tasks demonstrate ColNet’s improvements in F1 score and validation loss (Feng et al., 17 Jan 2025).

Dataset	Tasks / Groups	Key Results
CelebA	6 attrs, 2 gr	Dynamic clustering: early convergence, lower final loss
CIFAR-10	2 groups	ColNet: F1 improvement from .69 (FedPer) to .77 (ColNet, animals)
Synthetic	up to 800	Communication-efficient methods outperform centralized for large $W$ 5

Reinforcement Learning

GridWorld: Decentralized policy gradient balances trade-offs among environments, converging near-optimally (Zeng et al., 2020).
Drone navigation: Agents in diverse environments share a policy representation, obtaining dramatic gains in mean safe flight (Zeng et al., 2020).

Online and Streaming Contexts

Multitask diffusion LMS: Agents solving regression tasks with latent structure demonstrate quantifiable improvements in mean-square deviation and rapid adaptation, validated by closed-form theory (Chen et al., 2017).

6. Limitations, Open Questions, and Future Directions

The following observations and limitations have emerged from published research:

Communication overhead: Exchanging shared gradients or representation matrices incurs additional cost, though recent work reduces dependence on $W$ 6 (Mortaheb et al., 2022, Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).
Hyperparameter sensitivity: Performance is contingent upon cluster window size $W$ 7, task groupings, leader rotation frequency, and other algorithmic choices (Mortaheb et al., 2022, Feng et al., 17 Jan 2025).
Theoretical extensions: Convergence analysis for deep nonconvex architectures and complex multi-agent reinforcement learning regimes remains open (Mortaheb et al., 2022, Feng et al., 17 Jan 2025, Zeng et al., 2020).
Grouping mechanisms: While ColNet uses static, label-based grouping, clustering algorithms leveraging inter-task distance may further optimize grouping (Feng et al., 17 Jan 2025).
Assumptions: Most sample and communication complexity results hold under Gaussian input, incoherence, and connected graph assumptions; relaxation to more general settings is an active area.

A plausible implication is that Dec-MTRL, when combined with data-driven task grouping and topology adaptation, promises additional efficiency gains, especially in environments characterized by high task diversity and limited bandwidth.

7. Synthesis and Research Directions

Decentralized Multi-Task Representation Learning constitutes a rapidly maturing paradigm that addresses scalability, heterogeneity, and privacy concerns in distributed learning. Principal advances include:

Dynamic topology adaptation via gradient-based clustering (Mortaheb et al., 2022)
Conflict-averse aggregation for federated multi-task scenarios (Feng et al., 17 Jan 2025)
Provably communication-efficient alternating minimization under low-rank models (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025)
Modular protocol integration for reinforcement learning and online adaptation (Zeng et al., 2020, Chen et al., 2017)

A plausible direction is the integration of advanced graph neural networks, deeper representation hierarchies, and on-device privacy-preserving computation. Further, rigorous convergence analysis under adversarial or time-varying graphs will be essential to guarantee robustness in next-generation decentralized multi-task systems.