GradNorm-Based Dynamic Loss Balancing

Updated 26 November 2025

The paper introduces a method that dynamically adapts loss weights by equalizing gradient norms, ensuring balanced training progress across tasks.
GradNorm-based dynamic loss balancing is defined as an algorithm that adapts weights in multi-task deep learning, particularly in PINNs and PDE-constrained problems, using network scaling and physical normalization.
Empirical results show that while GradNorm outperforms static methods, its performance significantly improves when combined with explicit normalization to handle extreme loss scale disparities.

GradNorm-based dynamic loss balancing refers to a class of algorithms that adaptively reweight multiple loss components in deep neural network training, with the specific goal of maintaining balanced optimization progress across tasks. These methods are especially relevant in multi-task and multi-physics contexts, including physics-informed neural networks (PINNs) and multi-objective PDE-constrained learning, where loss terms may differ by several orders of magnitude and simple static weighting schemes fail to achieve satisfactory convergence or generalization. GradNorm is the archetypal algorithm in this class, dynamically tuning loss weights to equalize per-task gradient magnitudes, often in conjunction with network normalization or scaling layers to further address divergent loss scales. Recent developments further integrate explicit physical scaling and scale normalization to yield robust, principled loss balancing across highly heterogeneous domains.

1. Motivation and Problem Formulation

In multi-task learning, PINNs, and multi-physics models, the training objective takes the form of a weighted sum of $N$ losses,

$L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$

where $\theta$ are network parameters, $L_i$ is the $i$ th task or physics loss (e.g., data misfit, PDE residual, boundary/initial condition), and $w_i(t)$ is a task weight, possibly time-dependent or adaptive. The individual losses $L_i$ may arise from fundamentally distinct physical laws, observables, or constraints, and thus exhibit non-comparable scales and gradients. Unbalanced losses often lead to premature overfitting, stagnation, or domination of one task at the expense of others. In poroelastography, for instance, $L_i$ may correspond to real/imaginary components of Biot momentum/mass PDEs, and material parameters may span several decades in value (Xu et al., 2024).

2. GradNorm: Algorithmic Principles

GradNorm, introduced by Chen et al. (Chen et al., 2017), automates loss weighing by equalizing the relative training rates of all loss components via dynamic adaptation of $w_i$ . The approach includes:

Relative training rate: For each task, define the normalized inverse rate,

$r_i(t) = \frac{L_i(t)/L_i(0)}{\sum_{j=1}^N L_j(t)/L_j(0)}$

where $L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$ 0 is the initial value for normalization.

Gradient norm computation: For every weighted task, compute the gradient norm with respect to shared parameters $L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$ 1,

$L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$ 2

The mean gradient norm is $L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$ 3.

Target matching and auxiliary loss: For a hyperparameter $L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$ 4,

$L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$ 5

and define the auxiliary loss,

$L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$ 6

Weight update: Update $L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$ 7 by descending $L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$ 8,

$L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),$ 9

followed by renormalization $\theta$ 0 to preserve loss scale.

The algorithm ensures that tasks with slower progress (higher $\theta$ 1) receive greater gradient magnitude, thereby accelerating their learning; rapidly converging or over-represented tasks have $\theta$ 2 reduced.

3. Integration with Network Scaling and Physical Normalization

A key challenge identified in multi-physics and multi-scale problems is that gradient-based balancing alone is insufficient when loss terms are separated by large intrinsic scale differences (Xu et al., 2024). To address this, the network scaling approach represents each property map $\theta$ 3 as the product of a unit shape function learned by an MLP (with $\theta$ 4 weights) and a scaling factor $\theta$ 5 tailored to the physical property (e.g., permeability, shear modulus). The final output for each property is

$\theta$ 6

with $\theta$ 7 selected from plausible physical scales. This architectural normalization stabilizes parameter magnitudes, enforces explicit correspondence between network output scales and physical reality, and underpins explicit scale estimation for each $\theta$ 8 and its gradient, enabling fair balancing by GradNorm or related methods.

Dynamic scaling (termed "DynScl" in (Xu et al., 2024)) further extends this by analytically setting the weights to equalize both the scale and Lipschitz constants of all loss terms, automatically normalizing derivatives even before dynamic balancing.

4. Implementation Workflows and Algorithmic Details

A canonical training loop for GradNorm-based loss balancing includes:

Forward computation of each $\theta$ 9 and total loss $L_i$ 0.
Backward pass to update standard network parameters $L_i$ 1.
Computation of per-loss weighted gradients, $L_i$ 2.
Calculation of normalized rates $L_i$ 3 and targets $L_i$ 4.
Formulation and backward computation of $L_i$ 5, the auxiliary loss over $L_i$ 6.
Gradient step on $L_i$ 7, followed by normalization.

By contrast, algorithms such as SoftAdapt set $L_i$ 8 using a softmax over recent loss decrease rates, but do not reference gradient norms or physical scale, resulting in heuristic rather than principled balancing (Xu et al., 2024).

5. Empirical Evidence and Comparative Performance

The effectiveness of GradNorm, especially when combined with network scaling, has been evaluated in a variety of test cases (Xu et al., 2024, Chen et al., 2017, Bischof et al., 2021). Notable outcomes include:

Method	Max Rel. Error (High-Permeability Region)	Observed Characteristics
Equal Weights	100%	Divergence or stagnation without scaling
SoftAdapt	9.8%	Unstable; some parameter estimates diverge
GradNorm (GN)	18.4%	O(1) weights; some tasks remain unconverged
Dynamic Scaling	2.2%	Near-uniform convergence, low final error

These results demonstrate that network scaling is essential: without architectural output normalization, all balancing schemes fail. GradNorm outperforms static weights but can leave some tasks unconverged. Purely physics-driven scaling (dynamic scaling) yields the most consistent and robust accuracy, with all sublosses decaying at matched rates and parameter errors $L_i$ 9 (Xu et al., 2024).

In PINN benchmarks (Bischof et al., 2021), GradNorm provides reliable training progress balancing for moderate numbers of comparably scaled losses but incurs extra computational overhead (e.g., $i$ 0130 seconds per 1000 steps for 9 loss terms). Performance deteriorates when loss components differ by $i$ 1 in scale, unless combined with additional normalization.

6. Strengths, Limitations, and Practical Guidance

GradNorm-based dynamic loss balancing offers principled adaptation without ad hoc hyperparameter search for weights:

Strengths:
- Automated rebalancing across loss terms.
- Prevents domination of "easy" or large-scale losses.
- Encourages uniform progress on all objectives (Chen et al., 2017, Xu et al., 2024, Bischof et al., 2021).
Limitations:
- Computational cost increases linearly with the number of losses.
- Efficacy diminishes in the presence of extreme loss scale imbalance unless architectural normalization (network scaling) or explicit physics-based weights ("dynamic scaling") are applied (Xu et al., 2024, Bischof et al., 2021).
- Requires tuning of an additional hyperparameter $i$ 2 (restoring force).
- In larger-scale problems or with highly heterogeneous task scaling, lighter-weight or analytic normalization methods may outperform GradNorm.
Practical advice:
- Initialize all $i$ 3 and set $i$ 4; use lower values if loss scales differ by several orders of magnitude.
- Employ network scaling to ensure all outputs, derivatives, and losses operate on comparable numerical scales.
- When $i$ 5, consider updating $i$ 6 at reduced frequency to control overhead (Bischof et al., 2021).

7. Synthesis and Outlook

GradNorm-based dynamic loss balancing combines adaptive gradient norm matching with, increasingly, explicit scale normalization at both the network and loss levels. This synergy is particularly effective in multi-physics and multi-objective regimes with intrinsic scale disparity, as in poroelastography and PINNs. While GradNorm remains a leading method for dynamic adaptation across tasks, empirical evidence is unequivocal that its real-world applicability hinges on normalization—either via neural architectural design (network scaling), analytic physical weighting, or a hybrid. In scenarios where losses are comparable and the number of tasks is moderate, GradNorm provides an automated and robust solution. With increasing complexity, direct analytic scale-matching or lighter-weight schemes become essential to maintain convergence, stability, and computational efficiency (Xu et al., 2024, Bischof et al., 2021).