Adaptive Task Weighting in MTL

Updated 19 February 2026

Adaptive task weighting is a dynamic approach that assigns varying importance to tasks based on loss, gradient, or performance metrics.
It employs diverse strategies such as loss-based scaling, uncertainty weighting, and gradient conflict resolution to balance contributions in multi-task models.
The technique improves multi-task learning performance across domains like computer vision and NLP by mitigating negative transfer and promoting faster convergence.

Adaptive task weighting in multi-task learning (MTL) refers to strategies that dynamically modulate the contribution of each task to the shared representation or global loss function, in order to promote positive transfer and mitigate negative transfer. Rather than relying on static, manually chosen task weights, adaptive weighting mechanisms use online signals—typically derived from loss, gradient, or performance dynamics—to automatically allocate learning capacity where it is most beneficial. Adaptive task weighting is central to modern MTL across domains including computer vision, natural language processing, recommendation, and quantitative biology.

1. Conceptual Foundation: The Role of Task Weighting in MTL

In multi-task neural architectures, a single model $f_\theta$ is usually trained on $T$ tasks, each with its own loss function $L_i(\theta)$ . The canonical optimization objective is a convex combination:

$L_{\text{MTL}}(\theta) = \sum_{i=1}^T w_i L_i(\theta)$

where $w_i$ are the task weights. Static weighting (e.g., $w_i=1$ for all $i$ ) often leads to suboptimal outcomes due to divergent loss scales, convergence speeds, or intrinsic data difficulty among tasks (Huq et al., 2023). Adaptive weighting mechanisms address these challenges by online reallocation of weights, ensuring that difficult or underperforming tasks do not get starved, and suppressing signals from tasks exhibiting negative transfer or irreducible error (He et al., 2024).

Negative transfer arises when knowledge from one task impairs another, particularly when gradients conflict or auxiliary tasks contain label noise. Adaptive weighting plays a central role in balancing positive and negative transfer, selectively amplifying beneficial signals while diminishing deleterious contributions (Yim et al., 2020).

2. Methodological Landscape of Adaptive Task Weighting

Numerous adaptive weighting paradigms have been established, differing in their analytical motivation, update granularity (task-level, class-level, or sample-level), and computational requirements.

Method	Weighting Principle	Granularity
Loss-based Scaling	Proportional to instantaneous loss	Task
Uncertainty Weighting	Inverse to predicted uncertainty	Task
Excess Risk Balancing	Distance from task optimum	Task
Class-wise Differentiation	Per class (bin/label) gradient/impact	Class
Sample-level Alignment	Gradient alignment w.r.t. main task	Sample
Gradient Norm/Conflict	Magnitude or direction of gradients	Task/Gradient
Performance-Driven	Based on task-specific metrics	Task
Meta-Learned/Hyper	Outer-loop optimization of weights	Task (or finer)

2.1 Loss and Performance-based Weighting

Several approaches adjust weights in proportion to current loss, under the assumption that harder tasks (higher loss) merit greater emphasis. A canonical scheme sets

$w_i = \frac{T L_i(\theta)}{\sum_{j=1}^T L_j(\theta)}$

where $T$ is the number of tasks, ensuring normalization (Huq et al., 2023). Extensions use performance metrics such as accuracy, updating $w_i$ multiplicatively depending on whether $A_i$ (task $i$ 's accuracy) falls below or exceeds the average (Mohamed et al., 29 May 2025). This is especially useful for multi-task classification settings (e.g., ChestX-ray14).

2.2 Uncertainty-driven and Analytical Weighting

Uncertainty-based weighting assigns greater focus to tasks with higher estimated prediction noise, as formalized by Kendall et al. (commonly known as "UW") (Kirchdorfer et al., 2024). The analytic derivation in "Soft Optimal Uncertainty Weighting" replaces iterative $\sigma_k$ learning with a closed-form solution: $w_k' = 1/L_k$ , followed by tempering and softmax normalization via a temperature hyperparameter. This controls the sharpness of the weighting and reduces issues with loss scale divergence.

2.3 Excess Risk and Trainable Capacity-based Schemes

ExcessMTL (He et al., 2024) defines task weights via excess risk, i.e., the difference between current and Bayes-optimal loss, approximated via a diagonal Fisher/AdaGrad-like accumulator. Updating $w_i$ (or $\alpha_i$ ) via exponentiated gradient ascent on this proxy robustly avoids overweighting noisy or irreducible-error tasks. This approach has demonstrated Pareto stationarity and resilience to label noise.

AdaTask (Yang et al., 2022) targets the “task dominance” problem by separating per-task accumulative gradient statistics for each parameter. It maintains distinct momentum and adaptive learning rate accumulators per task, ensuring that no single task's gradient history overwhelms the optimizer, thereby providing balanced parameter updates even in deep shared architectures.

2.4 Gradient-Conflict and Gradient-Norm Approaches

Gradient-based schemes include GradNorm (controls relative gradient magnitudes), PCGrad (projects away conflicting gradients), and recent extensions such as wPCGrad (Bohn et al., 2024), which modulate the conflicting gradient projection via a probability distribution (parameters set by, e.g., recent task loss raised to a focusing exponent). wPCGrad allows dynamic prioritization: when gradients for two tasks are in conflict, the probability of which task's gradient is preserved is adaptively reweighted.

2.5 Class-wise and Sample-level Weighting

Per-class strategies (as in (Yim et al., 2020)) assign weights $w^c$ to each class $c$ within a task, updating these based on the contribution of each class-conditional loss component to the main-task loss. This mechanism, especially for auxiliary tasks, can amplify positive transfer at the class level while suppressing negative or noisy classes.

Sample-level approaches (e.g., SLGrad (Grégoire et al., 2023)) compute per-sample alignment between auxiliary and main-task gradients, assigning zero weight to harmful (negatively aligned) samples. The result is an extremely fine-grained and meta-objective-driven filtering of training data, enabling selective suppression of noisy or adversarial auxiliary samples.

3. Detailed Algorithms and Training Protocols

A representative selection of algorithms is summarized here.

Decompose auxiliary loss per class: $L_a^c = \frac{1}{N_c}\sum_{i:y_i=c} D_a(y_i, o_i)$
At each mini-batch, for each class $c$ $c$ :
- Compute finite differences: $\frac{L_m^c(t)-L_m^c(t-1)}{L_a^c(t)-L_a^c(t-1)}$
- Multiply by the mean class loss for stability.
- Normalize, apply Adam, use ReLU to enforce non-negativity.
Backpropagate main-task and weighted auxiliary losses jointly.
Optionally freeze $w^c$ for warm-up epochs; hyperparameters tuned on main-task validation split.
After training, retain only the refined main-task branch and shared encoder.

For each batch, compute $L_k$ (loss per task).
Set $s_k = 1 / \text{sg}(L_k)$ (stop-gradient).
Normalize via softmax with temperature $T$ : $w_k = \frac{\exp(s_k/T)}{\sum_j\exp(s_j/T)}$ .
Form $L = \sum_k w_k L_k$ ; backpropagate.

Maintain $m_t^k$ , $G_t^k$ for each parameter and task, i.e., distinct Adam-style statistics.
Compute per-task updates and sum: $\theta_{t+1} = \theta_t + \sum_k \Delta\theta_t^k$ .
Empirically, rAU (ratio of Average squared Update per task) is balanced for all parameters, especially in high shared layers.

For each task $i$ compute per-step gradients $g_i^{(t)}$ .
Accumulate squared gradients for a diagonal Fisher approximation.
Estimate excess risk: $\hat{R}_i^{(t)} = g_i^{(t)T} (\text{diag}(\sum_{\tau=1}^t g_i^{(\tau)} g_i^{(\tau)T}))^{-1/2} g_i^{(t)}$ .
Update weight vector $\alpha_i^{(t+1)} \propto \alpha_i^{(t)} \exp(\eta_\alpha \hat{R}_i^{(t)})$ .

Maintain probability vector $p_i$ based on recent task loss (possibly $p_i \propto L_i^{\gamma}$ ).
For each batch, sample an anchor task $T_i \sim \mathcal{D}$ .
When gradient conflicts exist, project conflicting gradients onto the anchor task's gradient; otherwise, sum as usual.
Update probabilities after each epoch.

4. Empirical Results and Comparative Performance

Multiple published results demonstrate the benefits of adaptive task weighting, with consistent superiority over static weighting and baseline heuristics.

Method / Dataset	Key Metric(s)	Result / Improvement	Reference
Class-wise weighting (ours)	Main-task loss (Cityscapes, NYUv2, vKITTI)	Lowest, outperforms Pareto MTL	(Yim et al., 2020)
HydaLearn	AUC (MIMIC, Fannie Mae)	0.839 (vs. GradNorm 0.767)	(Verboven et al., 2020)
AdaTask	Avg RMSE (synthetic)	0.056 (–38% vs GradNorm)	(Yang et al., 2022)
DeepChest	Avg Acc (ChestX-ray14)	94.96% (+7.4pp over prior)	(Mohamed et al., 29 May 2025)
ExcessMTL	Pareto-front retention (noisy MTL)	Maintains clean task performance	(He et al., 2024)
wPCGrad (DTP)	mAP/NDS (nuScenes); Acc (CelebA)	+7.2% mAP; +0.4%–0.9% Acc.	(Bohn et al., 2024)
Soft Uncertainty (UW-SO)	Δₘ NYUv2, Cityscapes	Matches Scalarization, outperforms others	(Kirchdorfer et al., 2024)

Empirical findings indicate that class-wise and sample-wise schemes (e.g., SLGrad, class-wise Adamized weights) provide additional robustness against label noise and modally different task structures. Gradient conflict-aware approaches (PCGrad, wPCGrad) can further improve performance in highly conflictual multitask setups, yielding gains especially in safety-critical or balancing-sensitive tasks.

5. Hyperparameterization, Implementation, and Practical Insights

Several methods introduce minimal or no tunable hyperparameters (e.g., loss-normalized weighting, analytic uncertainty weighting), while others require careful tuning (learning rate for adaptive weights, Adam parameters, temperature for softmax, exponent $\gamma$ for probability-based projection). Augmenting adaptive weight updates with initial warm-up or normalization is frequently recommended to stabilize training curves (Yim et al., 2020, Yang et al., 2022).

Implementation overhead varies: sample-level and gradient-conflict methods (SLGrad, PCGrad, wPCGrad) may incur an extra or double backward pass per batch, while performance-driven and analytic uncertainty methods add negligible cost.

Recommendations based on empirical studies:

Warm-up adaptive weights for several epochs before enabling dynamic updates (Yim et al., 2020).
Use mini-batches large enough for reliable per-task or per-class statistics; avoid extremely small batches for sample-level methods.
For modern large-scale models, the capacity often mitigates the absolute influence of task weighting; however, method choice remains critical in resource-constrained or highly imbalanced tasks (Kirchdorfer et al., 2024).
For tasks susceptible to significant label noise or Bayes-error variation, excess-risk-based methods are uniquely robust (He et al., 2024).

6. Challenges, Limitations, and Future Directions

Adaptive task weighting is not a panacea; it can occasionally lead to undesirable outcomes:

If a task's loss collapses to near-zero, loss- or performance-based weighting can orphan that task (weight $\rightarrow 0$ ) (Huq et al., 2023).
High-variance loss landscapes, especially under strong data augmentation or in low-data regimes, can induce oscillatory or unstable weights.
Methods reliant on a reference metric (main-task validation) may transfer bias or noise if the reference itself is not reliable (Grégoire et al., 2023).
Gradient-conflict-based approaches (PCGrad, wPCGrad) provide benefit only when gradient conflicts are sufficiently frequent; otherwise, they reduce to uniform weighting (Bohn et al., 2024).

Suggested directions include better integration of excess-risk and gradient-conflict signals, full meta-learning of both weight update rules and their hyperparameters, and scalable application to highly multi-modal or hierarchical tasks. Theoretical convergence analyses, especially for schemes that adjust optimizer parameters or exploit sample-level statistics, remain open in the non-convex regime (Yang et al., 2022, He et al., 2024).

7. Summary Table of Exemplary Adaptive Weighting Methods

Approach	Principle	Notable Features	Key Reference
Class-wise finite diff	Class-conditional loss delta	Per-class Adam update, stabilizes transfer	(Yim et al., 2020)
HydaLearn	Mini-batch meta-gain for main task	Per-batch dynamic, uses fake look-ahead	(Verboven et al., 2020)
AdaTask	Per-task adaptive optimizer stats	Restores per-task rates, balances high layers	(Yang et al., 2022)
ExcessMTL	Excess risk (distance to optimum)	Downweights noisy/irreducible tasks	(He et al., 2024)
SLGrad	Sample-level gradient alignment	Fine-grained, filters harmful aux samples	(Grégoire et al., 2023)
DeepChest	Accuracy-driven, gradient-free	Fast, model-agnostic, memory-efficient	(Mohamed et al., 29 May 2025)
wPCGrad	Probabilistic conflict projection	Gradient-level prioritization in conflict	(Bohn et al., 2024)

Adaptive task weighting is foundational for robust, efficient, and fair multi-task learning, with a spectrum of techniques optimized for different loss structures, computational budgets, and robustness requirements. The methodological diversity—from class-wise Adam updates to fully gradient-free epoch-level heuristics—enables practitioners to select or compose schemes optimized for their specific domain and problem instance.

Markdown Report Issue Upgrade to Chat

References (9)

Adaptive Weight Assignment Scheme For Multi-task Learning (2023)

Robust Multi-Task Learning with Excess Risks (2024)

Learning Boost by Exploiting the Auxiliary Task in Multi-task Domain (2020)

DeepChest: Dynamic Gradient-Free Task Weighting for Effective Multi-Task Learning in Chest X-ray Classification (2025)

Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning (2024)

AdaTask: A Task-aware Adaptive Learning Rate Approach to Multi-task Learning (2022)

Task Weighting through Gradient Projection for Multitask Learning (2024)

Sample-Level Weighting for Multi-Task Learning with Auxiliary Tasks (2023)

HydaLearn: Highly Dynamic Task Weighting for Multi-task Learning with Auxiliary Tasks (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Task Weighting in Multi-Task Learning (MTL).

Adaptive Task Weighting in MTL

1. Conceptual Foundation: The Role of Task Weighting in MTL

2. Methodological Landscape of Adaptive Task Weighting

2.1 Loss and Performance-based Weighting

2.2 Uncertainty-driven and Analytical Weighting

2.3 Excess Risk and Trainable Capacity-based Schemes

2.4 Gradient-Conflict and Gradient-Norm Approaches

2.5 Class-wise and Sample-level Weighting

3. Detailed Algorithms and Training Protocols

3.1 Class-wise Finite-Difference Weighting (Yim et al., 2020)

3.2 Analytic Uncertainty-based Softmax Weighting (Kirchdorfer et al., 2024)

3.3 AdaTask: Per-task Adaptive Optimizers (Yang et al., 2022)

3.4 ExcessMTL: Excess-Risk-based Exponentiated Gradient (He et al., 2024)

3.5 Gradient Projection with Task Prioritization (Bohn et al., 2024)

4. Empirical Results and Comparative Performance

5. Hyperparameterization, Implementation, and Practical Insights

6. Challenges, Limitations, and Future Directions

7. Summary Table of Exemplary Adaptive Weighting Methods

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Adaptive Task Weighting in MTL

1. Conceptual Foundation: The Role of Task Weighting in MTL

2. Methodological Landscape of Adaptive Task Weighting

2.1 Loss and Performance-based Weighting

2.2 Uncertainty-driven and Analytical Weighting

2.3 Excess Risk and Trainable Capacity-based Schemes

2.4 Gradient-Conflict and Gradient-Norm Approaches

2.5 Class-wise and Sample-level Weighting

3. Detailed Algorithms and Training Protocols

3.1 Class-wise Finite-Difference Weighting (Yim et al., 2020)

3.2 Analytic Uncertainty-based Softmax Weighting (Kirchdorfer et al., 2024)

3.3 AdaTask: Per-task Adaptive Optimizers (Yang et al., 2022)

3.4 ExcessMTL: Excess-Risk-based Exponentiated Gradient (He et al., 2024)

3.5 Gradient Projection with Task Prioritization (Bohn et al., 2024)

4. Empirical Results and Comparative Performance

5. Hyperparameterization, Implementation, and Practical Insights

6. Challenges, Limitations, and Future Directions

7. Summary Table of Exemplary Adaptive Weighting Methods

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research