Gradient-Aware Logit Adjustment (GALA)

Updated 10 February 2026

Gradient-Aware Logit Adjustment (GALA) is a loss formulation for imbalanced, long-tailed classification that dynamically adjusts logits using per-class gradient statistics.
It modifies training by incorporating running averages of positive and negative gradient magnitudes to counteract head-class bias in deep networks.
A post-hoc L1 normalization strategy further re-balances prediction scores, achieving notable top-1 accuracy improvements on benchmarks like CIFAR100-LT and iNaturalist2018.

Gradient-Aware Logit Adjustment (GALA) is a loss formulation designed for long-tailed classification, where the data distribution is highly imbalanced across classes. In such scenarios, conventional deep classifiers exhibit substantial bias toward head classes because their weights experience disproportionately large positive and negative gradients during training. GALA counters this by dynamically adjusting logits according to per-class accumulated gradient statistics, thus achieving more balanced optimization. Post-training, a test-time normalization strategy can further mitigate residual head-class bias, leading to top-1 accuracy improvements on benchmarks including CIFAR100-LT, Places-LT, and iNaturalist2018 (Zhang et al., 2024).

1. Long-Tailed Classification and Gradient Imbalance

A long-tailed classification setting is defined over $C$ classes, with each class $i$ possessing $n_i$ training samples. The degree of imbalance is quantified via the imbalance factor (IF):

$\text{IF} = \frac{\max_i n_i}{\min_i n_i}$

Deep networks operating in this regime produce features $x \in \mathbb{R}^d$ and logits

$z_j = \omega_j^T x + b_j,\quad j = 1,\ldots,C$

where $\omega_j$ is the $j$ th class vector and $b_j$ is a bias.

Standard training with cross-entropy loss

$\mathcal{L}_{\rm CE}(x, y) = -\log \frac{e^{z_y}}{\sum_{j=1}^C e^{z_j}}$

yields per-class gradients

$\frac{\partial \mathcal{L}_{\rm CE}}{\partial \omega_j} = (p_j - \mathbf 1\{j=y\})x,\quad p_j = \frac{e^{z_j}}{\sum_k e^{z_k}}$

Due to skewed sample counts, class weights for head classes accumulate much larger magnitudes of both positive gradients (from in-class samples) and negative gradients (from other classes). In contrast, tail-class weights experience overwhelming negative gradients originating from frequent head-class samples, exacerbating prediction bias.

2. GALA Loss Formulation

2.1. From Frequency-based to Gradient-based Adjustment

Prior work (Menon et al., 2021) compensates for imbalance by augmenting logits with terms proportional to $\log n_j$ , the log-frequency of each class. GALA generalizes this idea: rather than relying on static class-size statistics, GALA uses dynamically accumulated gradient magnitudes for logit adjustment.

2.2. Tracking Accumulated Gradient Magnitudes

GALA maintains two momentum-driven statistics for each class $j$ :

$\theta_j$ – running average of accumulated positive gradient magnitudes (for which $j$ is the correct label)
$\phi_j$ – running average of accumulated negative gradient magnitudes induced by class- $j$ samples onto all other class weights

The statistics are updated per mini-batch $t$ as follows, with momentum parameter $\mu \in [0,1)$ :

$\theta_j \leftarrow \mu\,\theta_j + (1-\mu)\left\|\frac{1}{|B_j|}\sum_{i\in B_j} \frac{\partial\mathcal{L}}{\partial\omega_j}(x_i,y_i)\right\|$

$\phi_j \leftarrow \mu\,\phi_j + (1-\mu)\frac{1}{|B_j|}\sum_{i\in B_j} \sum_{k\neq j} \left\|\frac{\partial\mathcal{L}}{\partial\omega_k}(x_i,y_i)\right\|$

where $B_j$ is the set of samples in the current batch with label $j$ .

2.3. Gradient-Aware Logit Adjustment

For a training sample $(x, y)$ , unadjusted logits $z_j$ are modified per

$\Delta_j(G) = \begin{cases} \alpha \left(\log\theta_j - \log\phi_y\right) & \text{if } j \neq y \ 0 & \text{if } j = y \end{cases}$

where $\alpha$ is a scale parameter tuned on validation data. Adjusted logits are then

$\tilde z_j = z_j + \Delta_j(G)$

and the GALA loss becomes

$\mathcal{L}_{\rm GALA}(x, y) = -\log \frac{e^{\tilde z_y}}{\sum_{j=1}^C e^{\tilde z_j}}$

Adding $+\log\theta_j$ to off-diagonal terms scales the negative gradient according to each class’s own positive-gradient history. Subtracting $\log\phi_y$ diminishes the influence of negative gradients originating from the true class, counterbalancing bias from imbalanced sample contributions.

3. Algorithmic Workflow

The GALA procedure per mini-batch is as follows:

Feature & Logit Computation: For each batch sample $(x_i, y_i)$ , compute features $x_i$ and raw logits $z_{i,j} = \omega_j^T x_i$ .
Logit Adjustment: For each $j \ne y_i$ , set $\Delta_{i,j} \leftarrow \alpha(\log \theta_j - \log \phi_{y_i})$ ; $\Delta_{i, y_i} \leftarrow 0$ . Set adjusted logits $\tilde z_{i,j} \leftarrow z_{i,j} + \Delta_{i,j}$ .
Loss & Gradient Step: Compute

$\mathcal{L} = -\sum_i \log \frac{e^{\tilde z_{i, y_i}}}{\sum_j e^{\tilde z_{i, j}}}$

and update $\omega_j$ via standard SGD.

Compute Batch Gradient Statistics:
- $g^+_j \leftarrow ||$ average $\partial \mathcal{L}/\partial \omega_j$ over $i$ with $y_i = j||$
- $g^-_j \leftarrow$ average over $i$ with $y_i = j$ of $\sum_{k\ne j} ||\partial \mathcal{L}/\partial \omega_k||$
Update Running Stats:

( \theta_j \leftarrow \mu\theta_j + (1-\mu) g^+_j,\quad \phi_j \leftarrow \mu\phi_j + (1-\mu) g^-_j )

Key hyperparameters are the momentum $\mu$ (e.g., 0.99) and the logit-adjustment scale $\alpha$ .

4. Post-hoc Prediction Re-balancing

Despite GALA training, models may retain a head-class prediction bias. A post-hoc normalization procedure is applied to test-set probability predictions $P = [p_{i,j}]$ (size $B \times C$ ):

Columns $p_{*j} \in \mathbb{R}^B$ for each class $j$ are re-scaled:

$\widetilde p_{*j} = \frac{p_{*j}}{\|p_{*j}\|_1^{\tau}},\quad j = 1,\ldots,C$

where $\tau$ (temperature) interpolates between raw predictions ( $\tau = 0$ ) and $L_1$ -normalized columns ( $\tau = 1$ ). Predictions are finalized as $\arg\max_j \widetilde p_{i,j}$ per test example.

When $\tau = 1$ , all classes are assigned equal total prediction mass; at $\tau = 0$ , there is no adjustment.

5. Empirical Performance

GALA and its prediction re-balancing variant are evaluated against other methods, notably the gradient-corrective loss (GCL), on established long-tailed benchmarks. The experimental results are as follows:

Dataset	Imbalance Factor	Baseline (CE)	GCL	GALA	GALA + Post-norm
CIFAR100-LT	100	38.43%	48.71%	52.10%	52.30%
Places-LT	~498	—	40.64%	41.00%	41.40%
iNaturalist2018	~500	—	72.10%	71.20%	73.30%
ImageNet-LT	—	—	54.8%	—	55.0%

GALA outperforms GCL by margins of up to 3.59% on CIFAR100-LT and 1.20% on iNaturalist2018. This indicates that per-class accumulated gradient statistics for logit adjustment and subsequent post-hoc normalization lead to improved top-1 accuracy on severely imbalanced datasets (Zhang et al., 2024).

6. Context within Long-Tailed Learning Research

GALA extends frequency-based logit adjustment strategies by harnessing signal from gradient statistics generated during optimization, enabling dynamic correction of gradient imbalance that is not captured by static sample counts. The approach operates entirely within the standard classification architecture and training regime, requiring only per-class momentum variables and no architectural modifications.

A key observation is that, even with adaptive logit adjustment, prediction bias toward head classes may persist post-training—necessitating the introduction of the post-hoc $L_1$ -based normalization strategy. This suggests that sources of bias in long-tailed recognition may remain partially unaddressed by loss reweighting or margin-based methods alone.

GALA’s empirical performance across multiple long-tailed benchmarks provides evidence supporting gradient-aware mechanisms as a valuable direction for future long-tailed and imbalanced learning research.

Markdown Report Issue Upgrade to Chat

References (1)

Gradient-Aware Logit Adjustment Loss for Long-tailed Classifier (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradient-Aware Logit Adjustment (GALA).