Papers
Topics
Authors
Recent
Search
2000 character limit reached

GradNorm-Based Dynamic Loss Balancing

Updated 26 November 2025
  • The paper introduces a method that dynamically adapts loss weights by equalizing gradient norms, ensuring balanced training progress across tasks.
  • GradNorm-based dynamic loss balancing is defined as an algorithm that adapts weights in multi-task deep learning, particularly in PINNs and PDE-constrained problems, using network scaling and physical normalization.
  • Empirical results show that while GradNorm outperforms static methods, its performance significantly improves when combined with explicit normalization to handle extreme loss scale disparities.

GradNorm-based dynamic loss balancing refers to a class of algorithms that adaptively reweight multiple loss components in deep neural network training, with the specific goal of maintaining balanced optimization progress across tasks. These methods are especially relevant in multi-task and multi-physics contexts, including physics-informed neural networks (PINNs) and multi-objective PDE-constrained learning, where loss terms may differ by several orders of magnitude and simple static weighting schemes fail to achieve satisfactory convergence or generalization. GradNorm is the archetypal algorithm in this class, dynamically tuning loss weights to equalize per-task gradient magnitudes, often in conjunction with network normalization or scaling layers to further address divergent loss scales. Recent developments further integrate explicit physical scaling and scale normalization to yield robust, principled loss balancing across highly heterogeneous domains.

1. Motivation and Problem Formulation

In multi-task learning, PINNs, and multi-physics models, the training objective takes the form of a weighted sum of NN losses,

L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),

where θ\theta are network parameters, LiL_i is the iith task or physics loss (e.g., data misfit, PDE residual, boundary/initial condition), and wi(t)w_i(t) is a task weight, possibly time-dependent or adaptive. The individual losses LiL_i may arise from fundamentally distinct physical laws, observables, or constraints, and thus exhibit non-comparable scales and gradients. Unbalanced losses often lead to premature overfitting, stagnation, or domination of one task at the expense of others. In poroelastography, for instance, LiL_i may correspond to real/imaginary components of Biot momentum/mass PDEs, and material parameters may span several decades in value (Xu et al., 2024).

2. GradNorm: Algorithmic Principles

GradNorm, introduced by Chen et al. (Chen et al., 2017), automates loss weighing by equalizing the relative training rates of all loss components via dynamic adaptation of wiw_i. The approach includes:

  • Relative training rate: For each task, define the normalized inverse rate,

ri(t)=Li(t)/Li(0)j=1NLj(t)/Lj(0)r_i(t) = \frac{L_i(t)/L_i(0)}{\sum_{j=1}^N L_j(t)/L_j(0)}

where L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),0 is the initial value for normalization.

  • Gradient norm computation: For every weighted task, compute the gradient norm with respect to shared parameters L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),1,

L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),2

The mean gradient norm is L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),3.

  • Target matching and auxiliary loss: For a hyperparameter L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),4,

L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),5

and define the auxiliary loss,

L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),6

  • Weight update: Update L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),7 by descending L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),8,

L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),9

followed by renormalization θ\theta0 to preserve loss scale.

The algorithm ensures that tasks with slower progress (higher θ\theta1) receive greater gradient magnitude, thereby accelerating their learning; rapidly converging or over-represented tasks have θ\theta2 reduced.

3. Integration with Network Scaling and Physical Normalization

A key challenge identified in multi-physics and multi-scale problems is that gradient-based balancing alone is insufficient when loss terms are separated by large intrinsic scale differences (Xu et al., 2024). To address this, the network scaling approach represents each property map θ\theta3 as the product of a unit shape function learned by an MLP (with θ\theta4 weights) and a scaling factor θ\theta5 tailored to the physical property (e.g., permeability, shear modulus). The final output for each property is

θ\theta6

with θ\theta7 selected from plausible physical scales. This architectural normalization stabilizes parameter magnitudes, enforces explicit correspondence between network output scales and physical reality, and underpins explicit scale estimation for each θ\theta8 and its gradient, enabling fair balancing by GradNorm or related methods.

Dynamic scaling (termed "DynScl" in (Xu et al., 2024)) further extends this by analytically setting the weights to equalize both the scale and Lipschitz constants of all loss terms, automatically normalizing derivatives even before dynamic balancing.

4. Implementation Workflows and Algorithmic Details

A canonical training loop for GradNorm-based loss balancing includes:

  1. Forward computation of each θ\theta9 and total loss LiL_i0.
  2. Backward pass to update standard network parameters LiL_i1.
  3. Computation of per-loss weighted gradients, LiL_i2.
  4. Calculation of normalized rates LiL_i3 and targets LiL_i4.
  5. Formulation and backward computation of LiL_i5, the auxiliary loss over LiL_i6.
  6. Gradient step on LiL_i7, followed by normalization.

By contrast, algorithms such as SoftAdapt set LiL_i8 using a softmax over recent loss decrease rates, but do not reference gradient norms or physical scale, resulting in heuristic rather than principled balancing (Xu et al., 2024).

5. Empirical Evidence and Comparative Performance

The effectiveness of GradNorm, especially when combined with network scaling, has been evaluated in a variety of test cases (Xu et al., 2024, Chen et al., 2017, Bischof et al., 2021). Notable outcomes include:

Method Max Rel. Error (High-Permeability Region) Observed Characteristics
Equal Weights 100% Divergence or stagnation without scaling
SoftAdapt 9.8% Unstable; some parameter estimates diverge
GradNorm (GN) 18.4% O(1) weights; some tasks remain unconverged
Dynamic Scaling 2.2% Near-uniform convergence, low final error

These results demonstrate that network scaling is essential: without architectural output normalization, all balancing schemes fail. GradNorm outperforms static weights but can leave some tasks unconverged. Purely physics-driven scaling (dynamic scaling) yields the most consistent and robust accuracy, with all sublosses decaying at matched rates and parameter errors LiL_i9 (Xu et al., 2024).

In PINN benchmarks (Bischof et al., 2021), GradNorm provides reliable training progress balancing for moderate numbers of comparably scaled losses but incurs extra computational overhead (e.g., ii0130 seconds per 1000 steps for 9 loss terms). Performance deteriorates when loss components differ by ii1 in scale, unless combined with additional normalization.

6. Strengths, Limitations, and Practical Guidance

GradNorm-based dynamic loss balancing offers principled adaptation without ad hoc hyperparameter search for weights:

  • Strengths:
  • Limitations:
    • Computational cost increases linearly with the number of losses.
    • Efficacy diminishes in the presence of extreme loss scale imbalance unless architectural normalization (network scaling) or explicit physics-based weights ("dynamic scaling") are applied (Xu et al., 2024, Bischof et al., 2021).
    • Requires tuning of an additional hyperparameter ii2 (restoring force).
    • In larger-scale problems or with highly heterogeneous task scaling, lighter-weight or analytic normalization methods may outperform GradNorm.
  • Practical advice:
    • Initialize all ii3 and set ii4; use lower values if loss scales differ by several orders of magnitude.
    • Employ network scaling to ensure all outputs, derivatives, and losses operate on comparable numerical scales.
    • When ii5, consider updating ii6 at reduced frequency to control overhead (Bischof et al., 2021).

7. Synthesis and Outlook

GradNorm-based dynamic loss balancing combines adaptive gradient norm matching with, increasingly, explicit scale normalization at both the network and loss levels. This synergy is particularly effective in multi-physics and multi-objective regimes with intrinsic scale disparity, as in poroelastography and PINNs. While GradNorm remains a leading method for dynamic adaptation across tasks, empirical evidence is unequivocal that its real-world applicability hinges on normalization—either via neural architectural design (network scaling), analytic physical weighting, or a hybrid. In scenarios where losses are comparable and the number of tasks is moderate, GradNorm provides an automated and robust solution. With increasing complexity, direct analytic scale-matching or lighter-weight schemes become essential to maintain convergence, stability, and computational efficiency (Xu et al., 2024, Bischof et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradnorm-based Dynamic Loss Balancing.