Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gradient-Aware Logit Adjustment (GALA)

Updated 10 February 2026
  • Gradient-Aware Logit Adjustment (GALA) is a loss formulation for imbalanced, long-tailed classification that dynamically adjusts logits using per-class gradient statistics.
  • It modifies training by incorporating running averages of positive and negative gradient magnitudes to counteract head-class bias in deep networks.
  • A post-hoc L1 normalization strategy further re-balances prediction scores, achieving notable top-1 accuracy improvements on benchmarks like CIFAR100-LT and iNaturalist2018.

Gradient-Aware Logit Adjustment (GALA) is a loss formulation designed for long-tailed classification, where the data distribution is highly imbalanced across classes. In such scenarios, conventional deep classifiers exhibit substantial bias toward head classes because their weights experience disproportionately large positive and negative gradients during training. GALA counters this by dynamically adjusting logits according to per-class accumulated gradient statistics, thus achieving more balanced optimization. Post-training, a test-time normalization strategy can further mitigate residual head-class bias, leading to top-1 accuracy improvements on benchmarks including CIFAR100-LT, Places-LT, and iNaturalist2018 (Zhang et al., 2024).

1. Long-Tailed Classification and Gradient Imbalance

A long-tailed classification setting is defined over CC classes, with each class ii possessing nin_i training samples. The degree of imbalance is quantified via the imbalance factor (IF):

IF=maxiniminini\text{IF} = \frac{\max_i n_i}{\min_i n_i}

Deep networks operating in this regime produce features xRdx \in \mathbb{R}^d and logits

zj=ωjTx+bj,j=1,,Cz_j = \omega_j^T x + b_j,\quad j = 1,\ldots,C

where ωj\omega_j is the jjth class vector and bjb_j is a bias.

Standard training with cross-entropy loss

LCE(x,y)=logezyj=1Cezj\mathcal{L}_{\rm CE}(x, y) = -\log \frac{e^{z_y}}{\sum_{j=1}^C e^{z_j}}

yields per-class gradients

LCEωj=(pj1{j=y})x,pj=ezjkezk\frac{\partial \mathcal{L}_{\rm CE}}{\partial \omega_j} = (p_j - \mathbf 1\{j=y\})x,\quad p_j = \frac{e^{z_j}}{\sum_k e^{z_k}}

Due to skewed sample counts, class weights for head classes accumulate much larger magnitudes of both positive gradients (from in-class samples) and negative gradients (from other classes). In contrast, tail-class weights experience overwhelming negative gradients originating from frequent head-class samples, exacerbating prediction bias.

2. GALA Loss Formulation

2.1. From Frequency-based to Gradient-based Adjustment

Prior work (Menon et al., 2021) compensates for imbalance by augmenting logits with terms proportional to lognj\log n_j, the log-frequency of each class. GALA generalizes this idea: rather than relying on static class-size statistics, GALA uses dynamically accumulated gradient magnitudes for logit adjustment.

2.2. Tracking Accumulated Gradient Magnitudes

GALA maintains two momentum-driven statistics for each class jj:

  • θj\theta_j – running average of accumulated positive gradient magnitudes (for which jj is the correct label)
  • ϕj\phi_j – running average of accumulated negative gradient magnitudes induced by class-jj samples onto all other class weights

The statistics are updated per mini-batch tt as follows, with momentum parameter μ[0,1)\mu \in [0,1):

θjμθj+(1μ)1BjiBjLωj(xi,yi)\theta_j \leftarrow \mu\,\theta_j + (1-\mu)\left\|\frac{1}{|B_j|}\sum_{i\in B_j} \frac{\partial\mathcal{L}}{\partial\omega_j}(x_i,y_i)\right\|

ϕjμϕj+(1μ)1BjiBjkjLωk(xi,yi)\phi_j \leftarrow \mu\,\phi_j + (1-\mu)\frac{1}{|B_j|}\sum_{i\in B_j} \sum_{k\neq j} \left\|\frac{\partial\mathcal{L}}{\partial\omega_k}(x_i,y_i)\right\|

where BjB_j is the set of samples in the current batch with label jj.

2.3. Gradient-Aware Logit Adjustment

For a training sample (x,y)(x, y), unadjusted logits zjz_j are modified per

Δj(G)={α(logθjlogϕy)if jy 0if j=y\Delta_j(G) = \begin{cases} \alpha \left(\log\theta_j - \log\phi_y\right) & \text{if } j \neq y \ 0 & \text{if } j = y \end{cases}

where α\alpha is a scale parameter tuned on validation data. Adjusted logits are then

z~j=zj+Δj(G)\tilde z_j = z_j + \Delta_j(G)

and the GALA loss becomes

LGALA(x,y)=logez~yj=1Cez~j\mathcal{L}_{\rm GALA}(x, y) = -\log \frac{e^{\tilde z_y}}{\sum_{j=1}^C e^{\tilde z_j}}

Adding +logθj+\log\theta_j to off-diagonal terms scales the negative gradient according to each class’s own positive-gradient history. Subtracting logϕy\log\phi_y diminishes the influence of negative gradients originating from the true class, counterbalancing bias from imbalanced sample contributions.

3. Algorithmic Workflow

The GALA procedure per mini-batch is as follows:

  1. Feature & Logit Computation: For each batch sample (xi,yi)(x_i, y_i), compute features xix_i and raw logits zi,j=ωjTxiz_{i,j} = \omega_j^T x_i.
  2. Logit Adjustment: For each jyij \ne y_i, set Δi,jα(logθjlogϕyi)\Delta_{i,j} \leftarrow \alpha(\log \theta_j - \log \phi_{y_i}); Δi,yi0\Delta_{i, y_i} \leftarrow 0. Set adjusted logits z~i,jzi,j+Δi,j\tilde z_{i,j} \leftarrow z_{i,j} + \Delta_{i,j}.
  3. Loss & Gradient Step: Compute

L=ilogez~i,yijez~i,j\mathcal{L} = -\sum_i \log \frac{e^{\tilde z_{i, y_i}}}{\sum_j e^{\tilde z_{i, j}}}

and update ωj\omega_j via standard SGD.

  1. Compute Batch Gradient Statistics:
    • gj+g^+_j \leftarrow ||average L/ωj\partial \mathcal{L}/\partial \omega_j over ii with yi=jy_i = j||
    • gjg^-_j \leftarrow average over ii with yi=jy_i = j of kjL/ωk\sum_{k\ne j} ||\partial \mathcal{L}/\partial \omega_k||
  2. Update Running Stats:

( \theta_j \leftarrow \mu\theta_j + (1-\mu) g+_j,\quad \phi_j \leftarrow \mu\phi_j + (1-\mu) g-_j )

Key hyperparameters are the momentum μ\mu (e.g., 0.99) and the logit-adjustment scale α\alpha.

4. Post-hoc Prediction Re-balancing

Despite GALA training, models may retain a head-class prediction bias. A post-hoc normalization procedure is applied to test-set probability predictions P=[pi,j]P = [p_{i,j}] (size B×CB \times C):

Columns pjRBp_{*j} \in \mathbb{R}^B for each class jj are re-scaled:

p~j=pjpj1τ,j=1,,C\widetilde p_{*j} = \frac{p_{*j}}{\|p_{*j}\|_1^{\tau}},\quad j = 1,\ldots,C

where τ\tau (temperature) interpolates between raw predictions (τ=0\tau = 0) and L1L_1-normalized columns (τ=1\tau = 1). Predictions are finalized as argmaxjp~i,j\arg\max_j \widetilde p_{i,j} per test example.

When τ=1\tau = 1, all classes are assigned equal total prediction mass; at τ=0\tau = 0, there is no adjustment.

5. Empirical Performance

GALA and its prediction re-balancing variant are evaluated against other methods, notably the gradient-corrective loss (GCL), on established long-tailed benchmarks. The experimental results are as follows:

Dataset Imbalance Factor Baseline (CE) GCL GALA GALA + Post-norm
CIFAR100-LT 100 38.43% 48.71% 52.10% 52.30%
Places-LT ~498 40.64% 41.00% 41.40%
iNaturalist2018 ~500 72.10% 71.20% 73.30%
ImageNet-LT 54.8% 55.0%

GALA outperforms GCL by margins of up to 3.59% on CIFAR100-LT and 1.20% on iNaturalist2018. This indicates that per-class accumulated gradient statistics for logit adjustment and subsequent post-hoc normalization lead to improved top-1 accuracy on severely imbalanced datasets (Zhang et al., 2024).

6. Context within Long-Tailed Learning Research

GALA extends frequency-based logit adjustment strategies by harnessing signal from gradient statistics generated during optimization, enabling dynamic correction of gradient imbalance that is not captured by static sample counts. The approach operates entirely within the standard classification architecture and training regime, requiring only per-class momentum variables and no architectural modifications.

A key observation is that, even with adaptive logit adjustment, prediction bias toward head classes may persist post-training—necessitating the introduction of the post-hoc L1L_1-based normalization strategy. This suggests that sources of bias in long-tailed recognition may remain partially unaddressed by loss reweighting or margin-based methods alone.

GALA’s empirical performance across multiple long-tailed benchmarks provides evidence supporting gradient-aware mechanisms as a valuable direction for future long-tailed and imbalanced learning research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradient-Aware Logit Adjustment (GALA).