Decentralized Multi-Task Representation Learning

Updated 3 January 2026

Decentralized Multi-Task Representation Learning is a paradigm that jointly learns low-rank shared feature representations across distributed agents without a central server.
It employs alternating minimization and dynamic task-driven communication protocols to optimize both task-specific factors and a common backbone, enhancing scalability and privacy.
Empirical and theoretical insights demonstrate robust convergence rates, efficient communication, and adaptability to heterogeneous data, diverse tasks, and varying local conditions.

Decentralized multi-task representation learning (Dec-MTRL) encompasses a class of algorithms that jointly learn feature representations across a network of agents, each tasked with different but potentially related objectives, under a decentralized (peer-to-peer) communication topology. Unlike centralized federated learning, Dec-MTRL dispenses with a global server, instead employing only local communications between neighbors to achieve collective representation learning. This paradigm is central to enabling scalable, privacy-preserving, and robust learning in scenarios with heterogeneous data, diverse local tasks, and stringent communication constraints.

1. Problem Formulations and Model Structures

Dec-MTRL models are developed to solve collections of $T$ tasks $\{\tau_1,...,\tau_T\}$ distributed over $L$ nodes. Each node $g$ holds data for a subset $S_g \subseteq \{1,...,T\}$ , and the communication topology is modeled by a connected undirected graph $G=(\mathcal{V},\mathcal{E})$ with suitable mixing weights $W$ ( $W$ is doubly-stochastic and $G$ is connected, i.e., $\gamma(W) < 1$ ).

A foundational objective in Dec-MTRL is the joint minimization of local losses: $\{\tau_1,...,\tau_T\}$ 0 where $\{\tau_1,...,\tau_T\}$ 1 is the low-rank shared representation (feature extractor), and $\{\tau_1,...,\tau_T\}$ 2 comprises task-specific factors ( $\{\tau_1,...,\tau_T\}$ 3) (Kang et al., 29 Dec 2025, Kang et al., 27 Dec 2025). In nonlinear settings, backbone networks (e.g. ResNet-18) act as the shared encoder, with task-heads providing personalized outputs (Feng et al., 17 Jan 2025).

Key variations of this basic template include:

Personalized layers $\{\tau_1,...,\tau_T\}$ 4 for client-specific adaptation (Feng et al., 17 Jan 2025, Mortaheb et al., 2022).
Sheaf-theoretic stalks $\{\tau_1,...,\tau_T\}$ 5 and edge spaces $\{\tau_1,...,\tau_T\}$ 6 to support variable feature dimensions and complex relations (2502.01145).
Joint optimization over model parameters and task-relationship graphs (e.g., GMRFs with a Laplacian precision matrix) to explicitly learn inter-task couplings (Wan et al., 12 Oct 2025).

2. Decentralized Learning Algorithms

Dec-MTRL leverages alternating minimization and decentralized diffusion protocols to solve low-rank multi-task regression, deep representation learning, or graph-based coupling. The principal algorithmic motifs are:

Alternating Projected Gradient and Minimization (AltGDmin): Each node alternates between locally minimizing the loss over $\{\tau_1,...,\tau_T\}$ ${τ_{1}, ..., τ_{T}}$ 7 (task-specific factors) and updating the shared representation $\{\tau_1,...,\tau_T\}$ ${τ_{1}, ..., τ_{T}}$ 8 by projected gradient descent, with local diffusion or consensus to propagate updates (Kang et al., 29 Dec 2025, Kang et al., 27 Dec 2025).
- Diffusion steps take the form $\{\tau_1,...,\tau_T\}$ 9, maintaining orthonormality of the feature basis.
- Communication per iteration is $L$ 0 per node, enabling scalability to large $L$ 1.
- The communication cost to reach a given accuracy is independent of the precision parameter $L$ 2, in contrast to centralizing methods where cost grows with required accuracy (Kang et al., 27 Dec 2025).
Dynamic Task-Driven Communication Graph Adaptation: Algorithms such as PD-MTL automatically infer beneficial task-to-task connections by periodically computing transference matrices $L$ 3 from exchanged gradients, then spectrally clustering nodes to rewire the communication topology (Mortaheb et al., 2022). This can accelerate convergence and prevent negative transfer in highly heterogeneous environments.
Conflict-Averse Aggregation: Frameworks such as ColNet group nodes by task or label, perform intra-group local averaging, and employ cross-group aggregation with a conflict-averse objective (Hyper-Conflict-Averse aggregator), optimizing:

$L$ 4

where $L$ 5 are group backbone deltas (Feng et al., 17 Jan 2025).
Sheaf-Theoretic and Graph-Based Regularization: Using cellular sheaves, Dec-MTRL can encode heterogeneous model dimensions and flexible regularization via the sheaf Laplacian $L$ 6, with updates over both model parameters and edge interaction maps $L$ 7. This formulation subsumes graph-Laplacian based multi-task learning as a special case (2502.01145).

3. Handling Heterogeneity: Data, Task, and Feature Types

Heterogeneity manifests across label distributions (label-skew), feature domains (covariate shift, concept shift), model architectures, and even the availability of particular tasks per node. Dec-MTRL approaches for these include:

Personalization via Model Decomposition: Partitioning the model into a shared encoder/backbone $L$ 8 and client- or task-specific heads $L$ 9 allows clients to address non-IID data and fundamental task divergence (Feng et al., 17 Jan 2025, Mortaheb et al., 2022).
Flexible Regularization: Employing task graphs (e.g., GMRF-Laplacian), sheaf interaction spaces, or adaptively clustering nodes based on observed task similarity or negative transfer estimates (2502.01145, Wan et al., 12 Oct 2025, Mortaheb et al., 2022).
Static versus Dynamic Group Formation: Some frameworks statically assign groups (e.g., by known task) with exogenous leader rotation (Feng et al., 17 Jan 2025), while others (e.g., PD-MTL) adapt the topology online (Mortaheb et al., 2022).

4. Theoretical Guarantees and Complexity

Recent Dec-MTRL analysis establishes sample, time, and communication complexity rates under linear and nonlinear task models:

Sample and Convergence Complexity: For linear Dec-MTRL with a rank- $g$ 0 shared subspace, error $g$ 1 is achievable if

$g$ 2

where $g$ 3 is the condition number and $g$ 4 encodes incoherence (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025). The number of gradient and consensus iterations scale logarithmically with $g$ 5 and $g$ 6.

Communication Complexity: Diffusion-based protocols achieve $g$ 7 messages per iteration, independent of $g$ 8, whereas centralized and multiple-consensus methods scale poorly with increasing accuracy or node count (Kang et al., 27 Dec 2025, Kang et al., 29 Dec 2025).
Optimization Guarantees: Subspace error contracts at a geometric rate $g$ 9 under standard initialization and network assumptions. Cross-task Laplacian learning schemes have established asymptotic normality and covariance estimation error rates $S_g \subseteq \{1,...,T\}$ 0 with vanishing bias as the batch size grows (Wan et al., 12 Oct 2025).

5. Experimental Evidence and Empirical Insights

Empirical results demonstrate the effectiveness of Dec-MTRL in both synthetic and real-world scenarios:

Algorithm	Setting	Key Metric	Best Reported Value(s)
ColNet	CIFAR-10 label het.	F1 (Animal group)	0.769
ColNet	CelebA task het.	F1 (Attribute)	0.605
PD-MTL	Synthetic Gaussian	Loss convergence	Clusters found Epoch 15
Sheaf-FMTL	CIFAR-10, various	Bits saved vs dFedU	up to $S_g \subseteq \{1,...,T\}$ 1
Dif-AltGDmin	Synthetic, $S_g \subseteq \{1,...,T\}$ 2 large	Subspace distance	Matches centralized

ColNet's two-phase aggregation and HCA account for substantial improvements in F1 and loss convergence under both label and task heterogeneity (Feng et al., 17 Jan 2025). PD-MTL dynamically reconfigures network topology for fast convergence and effective clustering in heterogeneous synthetic and CelebA benchmarks (Mortaheb et al., 2022). Sheaf-FMTL provides significant communication savings while robustly handling both feature and sample heterogeneity (2502.01145).

6. Implementation Considerations and Limitations

Several practical factors influence performance and deployment:

Leader Selection and Fault Tolerance: Rotating leaders in static groups improves robustness but is susceptible to stragglers and communication failures, necessitating future exploration of fault tolerance protocols (Feng et al., 17 Jan 2025).
Adaptive Topologies: Dynamic grouping based on data-driven similarity or transference metrics enhances adaptability but may incur computational overhead for frequent clustering and spectral Laplacian computation (Mortaheb et al., 2022).
Feature Heterogeneity: Sheaf-based models natively account for varying model/feature sizes through edge interaction spaces, but introduce storage and computation costs per node, making them most suitable for cross-silo (small $S_g \subseteq \{1,...,T\}$ 3, powerful nodes) rather than cross-device deployments (2502.01145).
Communication vs Computation Trade-offs: Many methods trade increased local computation (e.g., updating interaction maps or full local covariances) for a reduction in global communication (2502.01145, Kang et al., 29 Dec 2025).
Security and Privacy: Malicious nodes can poison local updates (e.g., backbone deltas); robust aggregation schemes and privacy mechanisms (differential privacy, resource-aware scheduling) are open directions (Feng et al., 17 Jan 2025).

7. Future Directions and Open Challenges

Dec-MTRL is advancing rapidly, but several research horizons remain:

Strong theoretical guarantees for nonlinear and deep architectures (beyond linear subspaces or shallow networks) are not yet established for the fully decentralized setting (Kang et al., 29 Dec 2025).
Asynchronous and time-varying communications, straggler robustness, differential privacy, and resource-aware optimization require further algorithmic development (Feng et al., 17 Jan 2025, Mortaheb et al., 2022).
Universal frameworks that unify graph-based, cluster-based, and sheaf-based regularization—well-suited for broad heterogeneity and scale—have yet to be standardized.
Empirical validation on diverse, non-synthetic benchmarks and at scale (e.g., $S_g \subseteq \{1,...,T\}$ 4, real-world data) is limited.
Exploration of dynamic peer-to-peer federated clustering driven directly by observed task transfer, negative correlation, or data similarity metrics is ongoing (Mortaheb et al., 2022).

Decentralized multi-task representation learning constitutes an essential component in next-generation federated and collaborative ML, promising scalability and efficiency under heterogeneity, while posing unique architectural and theoretical challenges distinct from both centralized and conventional decentralized learning paradigms.