Prevalence-Adjusted Softmax (PAS) Score

Updated 9 February 2026

Prevalence-Adjusted Softmax (PAS) Score is a technique that adjusts raw logits using estimated class priors to mitigate biases in imbalanced data.
It incorporates a tunable parameter and a sliding-window estimator to balance sensitivity and stability, improving model performance in continual learning.
Empirical results on benchmarks like CIFAR-10 and CIFAR-100 highlight significant accuracy improvements with negligible computational cost.

The Prevalence-Adjusted Softmax (PAS) Score, also referred to as Logit-Adjusted Softmax, is a method designed to address class-prior imbalance in neural network classifiers, particularly in the context of online continual learning. The approach is grounded in statistical theory, providing a principled corrective to the biases that arise when class distributions shift or are non-uniform over training. PAS works by modifying the softmax logits based on estimates of class prevalence and introduces a tunable mechanism to control the strength of this adjustment, thereby offering a versatile solution with minimal computational overhead (Huang et al., 2023).

1. Class-Prior Imbalance in Softmax Classifiers

Standard softmax classifiers for multiclass prediction are typically optimized via cross-entropy loss on raw logits $z_k(x)$ for each class $k$ : $p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ When classes appear with imbalanced frequencies (e.g., some "head" classes much more common than "tail" classes), the learned logits are biased towards frequent classes. This prioritization leads models to over-predict head classes and under-predict tail classes. In continual learning, this phenomenon manifests as "recency bias": as new classes dominate the data stream, the model's predictions become increasingly skewed toward recently encountered classes, causing catastrophic forgetting of earlier ones. This challenge is fundamentally an issue of drift in the underlying class-prior probabilities $\pi_k = \mathbb{P}(y=k)$ (Huang et al., 2023).

2. Bayes-Optimal Classification under Non-Uniform Priors

Bayesian decision theory dictates that optimal classification incorporates both class-conditionals and class-prior probabilities. For a data-generating process described by $\mathbb{P}(x, y)$ , with class priors $\pi_k$ and conditionals $p(x|y=k)$ ,

$p(y=k\mid x) \propto p(x\mid y=k)\,\pi_k.$

Suppose a model could output "pure" class-conditional logits $\Phi_k(x)\approx\ln p(x\mid y=k)$ ; then, the posterior is calculated as

$p(y=k\mid x) \propto \exp\bigl(\Phi_k(x) + \ln\pi_k\bigr).$

The typical classifier optimized under cross-entropy on imbalanced data instead produces an implicit logit of

$k$ 0

thereby entangling class-conditional information with the prevalence-induced log-bias. As a result, the learned mapping inherently absorbs the prior imbalance, which distorts predictions for minority classes unless corrected (Huang et al., 2023).

3. PAS Score: Definition, Formula, and Hyperparameters

PAS achieves prior adjustment by modifying the logit for each class as follows: $k$ 1 where:

$k$ 2: raw model logit for class $k$ 3,
$k$ 4: temperature/hyperparameter for adjustment strength (default $k$ 5),
$k$ 6: estimated class-prior for $k$ 7 at time $k$ 8.

The PAS cross-entropy loss for a labeled example $k$ 9 becomes: $p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ 0 Key limiting regimes:

$p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ 1 recovers standard cross-entropy,
$p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ 2 corresponds to an extreme regime analogous to "always train only on current classes".

Prior estimation $p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ 3 employs a sliding-window estimator of batch frequencies over the last $p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ 4 timesteps: $p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ 5 Here, window length $p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ 6 adjusts the tradeoff between sensitivity to change and stability (default $p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ 7). In practice, $p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ 8 and $p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}.$ 9 offer an effective balance (Huang et al., 2023).

4. Integration with Training Pipelines and Inference

PAS can be incorporated into most continual-learning workflows with minimal adaptation. The typical integration (using experience replay as an example) involves:

Forming a training batch that merges new and replay samples.
Updating the set of seen classes.
Estimating class priors $\pi_k = \mathbb{P}(y=k)$ 0 with a sliding window.
Computing adjusted logits by augmenting $\pi_k = \mathbb{P}(y=k)$ 1 with $\pi_k = \mathbb{P}(y=k)$ 2.
Calculating PAS cross-entropy loss and performing backpropagation.
Updating model parameters, as well as the replay buffer.

At inference, omitting the $\pi_k = \mathbb{P}(y=k)$ 3 adjustment yields pure class-conditional predictions; including it yields the Bayes-optimal posterior with respect to the estimated priors (Huang et al., 2023).

5. Computational Considerations

PAS introduces negligible computational overhead:

Logit adjustment costs $\pi_k = \mathbb{P}(y=k)$ 4 per forward/backward pass for $\pi_k = \mathbb{P}(y=k)$ 5 classes.
Maintaining the sliding-window prior estimator costs $\pi_k = \mathbb{P}(y=k)$ 6 per step ( $\pi_k = \mathbb{P}(y=k)$ 7: batch size).
Memory cost is $\pi_k = \mathbb{P}(y=k)$ 8.
The approach is orthogonal, compatible with cross-entropy-based continual-learning methods—rehearsal-oriented or not—and can be "dropped in" without requiring major changes to the learning pipeline (Huang et al., 2023).

6. Empirical Performance in Continual Learning

PAS demonstrates statistically significant improvements over baseline and state-of-the-art approaches on established continual-learning benchmarks:

Online class-incremental CIFAR-10 (5 tasks, buffer $\pi_k = \mathbb{P}(y=k)$ 9): Experience Replay (ER) baseline achieves $\mathbb{P}(x, y)$ 0; ER+PAS attains $\mathbb{P}(x, y)$ 1, a gain of $\mathbb{P}(x, y)$ 2 percentage points. For $\mathbb{P}(x, y)$ 3 k, ER+PAS reaches $\mathbb{P}(x, y)$ 4, matching or exceeding prior bests.
CIFAR-100 (10 tasks, $\mathbb{P}(x, y)$ 5 k): ER+PAS improves over prior best by $\mathbb{P}(x, y)$ 6 pp ( $\mathbb{P}(x, y)$ 7).
TinyImageNet (10 tasks): ER+PAS improves accuracy by $\mathbb{P}(x, y)$ 8 pp.
Long sequence (ImageNet-1k, 100 tasks, $\mathbb{P}(x, y)$ 9 k): ER baseline $\pi_k$ 0, ER+PAS $\pi_k$ 1.
Blurry Online CL (CIFAR-100): ER+PAS improves from $\pi_k$ 2 to $\pi_k$ 3.
PAS remains effective when used alongside advanced replay strategies, e.g., MIR, ASER, OCS (gains of $\pi_k$ 4 to $\pi_k$ 5 pp), and knowledge-distillation methods in general continual learning ( $\pi_k$ 6 to $\pi_k$ 7 pp).

Ablative analysis indicates:

$\pi_k$ 8: Recovers baseline ER performance.
$\pi_k$ 9: Leads to minimal forgetting but lower accuracy.
Use of random or "macro" (global) priors is inferior to the sliding-window estimator.
Empirically, $p(x|y=k)$ 0 and $p(x|y=k)$ 1 optimize the trade-off between stability and plasticity (Huang et al., 2023).

7. Limitations and Domain of Applicability

PAS specifically addresses class-prior imbalance. It does not compensate for domain shift in the class-conditional distributions $p(x|y=k)$ 2. In settings where class imbalance is not present (i.e., domain-incremental learning without class skew), PAS affords no benefit. Accurate online estimation of priors is essential: mismatches between the sliding window length and true dynamics of the data stream can reduce efficacy. In extremely low-data regimes per class, estimation noise in the log-prior can negatively impact performance and may require additional smoothing. PAS does not replace mechanisms required to address feature collapse or domain drift (Huang et al., 2023).

In summary, the Prevalence-Adjusted Softmax Score systematically corrects for class-prior bias by adding a log-prior adjustment to each class logit before the softmax. This statistically informed, easily implemented modification yields substantial empirical gains in continual learning scenarios with negligible computational cost, provided class-conditional stationarity and reliable prior estimation are maintained (Huang et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Online Continual Learning via Logit Adjusted Softmax (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prevalence-Adjusted Softmax (PAS) Score.