Gradient-Aware Logit Adjustment (GALA)
- Gradient-Aware Logit Adjustment (GALA) is a loss formulation for imbalanced, long-tailed classification that dynamically adjusts logits using per-class gradient statistics.
- It modifies training by incorporating running averages of positive and negative gradient magnitudes to counteract head-class bias in deep networks.
- A post-hoc L1 normalization strategy further re-balances prediction scores, achieving notable top-1 accuracy improvements on benchmarks like CIFAR100-LT and iNaturalist2018.
Gradient-Aware Logit Adjustment (GALA) is a loss formulation designed for long-tailed classification, where the data distribution is highly imbalanced across classes. In such scenarios, conventional deep classifiers exhibit substantial bias toward head classes because their weights experience disproportionately large positive and negative gradients during training. GALA counters this by dynamically adjusting logits according to per-class accumulated gradient statistics, thus achieving more balanced optimization. Post-training, a test-time normalization strategy can further mitigate residual head-class bias, leading to top-1 accuracy improvements on benchmarks including CIFAR100-LT, Places-LT, and iNaturalist2018 (Zhang et al., 2024).
1. Long-Tailed Classification and Gradient Imbalance
A long-tailed classification setting is defined over classes, with each class possessing training samples. The degree of imbalance is quantified via the imbalance factor (IF):
Deep networks operating in this regime produce features and logits
where is the th class vector and is a bias.
Standard training with cross-entropy loss
yields per-class gradients
Due to skewed sample counts, class weights for head classes accumulate much larger magnitudes of both positive gradients (from in-class samples) and negative gradients (from other classes). In contrast, tail-class weights experience overwhelming negative gradients originating from frequent head-class samples, exacerbating prediction bias.
2. GALA Loss Formulation
2.1. From Frequency-based to Gradient-based Adjustment
Prior work (Menon et al., 2021) compensates for imbalance by augmenting logits with terms proportional to , the log-frequency of each class. GALA generalizes this idea: rather than relying on static class-size statistics, GALA uses dynamically accumulated gradient magnitudes for logit adjustment.
2.2. Tracking Accumulated Gradient Magnitudes
GALA maintains two momentum-driven statistics for each class :
- – running average of accumulated positive gradient magnitudes (for which is the correct label)
- – running average of accumulated negative gradient magnitudes induced by class- samples onto all other class weights
The statistics are updated per mini-batch as follows, with momentum parameter :
where is the set of samples in the current batch with label .
2.3. Gradient-Aware Logit Adjustment
For a training sample , unadjusted logits are modified per
where is a scale parameter tuned on validation data. Adjusted logits are then
and the GALA loss becomes
Adding to off-diagonal terms scales the negative gradient according to each class’s own positive-gradient history. Subtracting diminishes the influence of negative gradients originating from the true class, counterbalancing bias from imbalanced sample contributions.
3. Algorithmic Workflow
The GALA procedure per mini-batch is as follows:
- Feature & Logit Computation: For each batch sample , compute features and raw logits .
- Logit Adjustment: For each , set ; . Set adjusted logits .
- Loss & Gradient Step: Compute
and update via standard SGD.
- Compute Batch Gradient Statistics:
- average over with
- average over with of
- Update Running Stats:
( \theta_j \leftarrow \mu\theta_j + (1-\mu) g+_j,\quad \phi_j \leftarrow \mu\phi_j + (1-\mu) g-_j )
Key hyperparameters are the momentum (e.g., 0.99) and the logit-adjustment scale .
4. Post-hoc Prediction Re-balancing
Despite GALA training, models may retain a head-class prediction bias. A post-hoc normalization procedure is applied to test-set probability predictions (size ):
Columns for each class are re-scaled:
where (temperature) interpolates between raw predictions () and -normalized columns (). Predictions are finalized as per test example.
When , all classes are assigned equal total prediction mass; at , there is no adjustment.
5. Empirical Performance
GALA and its prediction re-balancing variant are evaluated against other methods, notably the gradient-corrective loss (GCL), on established long-tailed benchmarks. The experimental results are as follows:
| Dataset | Imbalance Factor | Baseline (CE) | GCL | GALA | GALA + Post-norm |
|---|---|---|---|---|---|
| CIFAR100-LT | 100 | 38.43% | 48.71% | 52.10% | 52.30% |
| Places-LT | ~498 | — | 40.64% | 41.00% | 41.40% |
| iNaturalist2018 | ~500 | — | 72.10% | 71.20% | 73.30% |
| ImageNet-LT | — | — | 54.8% | — | 55.0% |
GALA outperforms GCL by margins of up to 3.59% on CIFAR100-LT and 1.20% on iNaturalist2018. This indicates that per-class accumulated gradient statistics for logit adjustment and subsequent post-hoc normalization lead to improved top-1 accuracy on severely imbalanced datasets (Zhang et al., 2024).
6. Context within Long-Tailed Learning Research
GALA extends frequency-based logit adjustment strategies by harnessing signal from gradient statistics generated during optimization, enabling dynamic correction of gradient imbalance that is not captured by static sample counts. The approach operates entirely within the standard classification architecture and training regime, requiring only per-class momentum variables and no architectural modifications.
A key observation is that, even with adaptive logit adjustment, prediction bias toward head classes may persist post-training—necessitating the introduction of the post-hoc -based normalization strategy. This suggests that sources of bias in long-tailed recognition may remain partially unaddressed by loss reweighting or margin-based methods alone.
GALA’s empirical performance across multiple long-tailed benchmarks provides evidence supporting gradient-aware mechanisms as a valuable direction for future long-tailed and imbalanced learning research.