Papers
Topics
Authors
Recent
Search
2000 character limit reached

Prevalence-Adjusted Softmax (PAS) Score

Updated 9 February 2026
  • Prevalence-Adjusted Softmax (PAS) Score is a technique that adjusts raw logits using estimated class priors to mitigate biases in imbalanced data.
  • It incorporates a tunable parameter and a sliding-window estimator to balance sensitivity and stability, improving model performance in continual learning.
  • Empirical results on benchmarks like CIFAR-10 and CIFAR-100 highlight significant accuracy improvements with negligible computational cost.

The Prevalence-Adjusted Softmax (PAS) Score, also referred to as Logit-Adjusted Softmax, is a method designed to address class-prior imbalance in neural network classifiers, particularly in the context of online continual learning. The approach is grounded in statistical theory, providing a principled corrective to the biases that arise when class distributions shift or are non-uniform over training. PAS works by modifying the softmax logits based on estimates of class prevalence and introduces a tunable mechanism to control the strength of this adjustment, thereby offering a versatile solution with minimal computational overhead (Huang et al., 2023).

1. Class-Prior Imbalance in Softmax Classifiers

Standard softmax classifiers for multiclass prediction are typically optimized via cross-entropy loss on raw logits zk(x)z_k(x) for each class kk: p(y=kx)=exp(zk(x))jexp(zj(x)).p(y=k\mid x) = \frac{\exp\bigl(z_k(x)\bigr)}{\sum_j \exp\bigl(z_j(x)\bigr)}. When classes appear with imbalanced frequencies (e.g., some "head" classes much more common than "tail" classes), the learned logits are biased towards frequent classes. This prioritization leads models to over-predict head classes and under-predict tail classes. In continual learning, this phenomenon manifests as "recency bias": as new classes dominate the data stream, the model's predictions become increasingly skewed toward recently encountered classes, causing catastrophic forgetting of earlier ones. This challenge is fundamentally an issue of drift in the underlying class-prior probabilities πk=P(y=k)\pi_k = \mathbb{P}(y=k) (Huang et al., 2023).

2. Bayes-Optimal Classification under Non-Uniform Priors

Bayesian decision theory dictates that optimal classification incorporates both class-conditionals and class-prior probabilities. For a data-generating process described by P(x,y)\mathbb{P}(x, y), with class priors πk\pi_k and conditionals p(xy=k)p(x|y=k),

p(y=kx)p(xy=k)πk.p(y=k\mid x) \propto p(x\mid y=k)\,\pi_k.

Suppose a model could output "pure" class-conditional logits Φk(x)lnp(xy=k)\Phi_k(x)\approx\ln p(x\mid y=k); then, the posterior is calculated as

p(y=kx)exp(Φk(x)+lnπk).p(y=k\mid x) \propto \exp\bigl(\Phi_k(x) + \ln\pi_k\bigr).

The typical classifier optimized under cross-entropy on imbalanced data instead produces an implicit logit of

zk(x)Φk(x)+lnπk,z_k(x) \approx \Phi_k(x) + \ln\pi_k,

thereby entangling class-conditional information with the prevalence-induced log-bias. As a result, the learned mapping inherently absorbs the prior imbalance, which distorts predictions for minority classes unless corrected (Huang et al., 2023).

3. PAS Score: Definition, Formula, and Hyperparameters

PAS achieves prior adjustment by modifying the logit for each class as follows: logitk(x)=zk(x)+τlnπk,t,\text{logit}_k(x) = z_k(x) + \tau\,\ln\pi_{k, t}, where:

  • zk(x)z_k(x): raw model logit for class kk,
  • τ>0\tau > 0: temperature/hyperparameter for adjustment strength (default τ=1\tau=1),
  • πk,t\pi_{k, t}: estimated class-prior for kk at time tt.

The PAS cross-entropy loss for a labeled example (x,y)(x, y) becomes: LPAS(y,x)=lnexp(zy(x)+τlnπy,t)jexp(zj(x)+τlnπj,t)=[zy(x)+τlnπy,tlnjezj(x)+τlnπj,t].\mathcal{L}_{\mathrm{PAS}} (y, x) = -\ln\frac{\exp(z_y(x) + \tau \ln \pi_{y, t})}{\sum_j \exp(z_j(x) + \tau \ln \pi_{j, t})} = -\left[z_y(x) + \tau \ln\pi_{y,t} - \ln\sum_j e^{z_j(x)+\tau\ln\pi_{j, t}}\right]. Key limiting regimes:

  • τ=0\tau=0 recovers standard cross-entropy,
  • τ\tau\to\infty corresponds to an extreme regime analogous to "always train only on current classes".

Prior estimation πk,t\pi_{k, t} employs a sliding-window estimator of batch frequencies over the last ll timesteps: πk,t=i=tl+1t#{yi=k in batch i}i=tl+1tbatchi.\pi_{k, t} = \frac{\sum_{i=t-l+1}^t \#\{y_i = k \text{ in batch }i\}}{\sum_{i=t-l+1}^t |\text{batch}_i|}. Here, window length ll adjusts the tradeoff between sensitivity to change and stability (default l=1l=1). In practice, τ1\tau\approx1 and l1l\approx1 offer an effective balance (Huang et al., 2023).

4. Integration with Training Pipelines and Inference

PAS can be incorporated into most continual-learning workflows with minimal adaptation. The typical integration (using experience replay as an example) involves:

  1. Forming a training batch that merges new and replay samples.
  2. Updating the set of seen classes.
  3. Estimating class priors πk,t\pi_{k, t} with a sliding window.
  4. Computing adjusted logits by augmenting zk(x)z_k(x) with τlnπk,t\tau\ln\pi_{k, t}.
  5. Calculating PAS cross-entropy loss and performing backpropagation.
  6. Updating model parameters, as well as the replay buffer.

At inference, omitting the +τlnπk+\tau\ln\pi_k adjustment yields pure class-conditional predictions; including it yields the Bayes-optimal posterior with respect to the estimated priors (Huang et al., 2023).

5. Computational Considerations

PAS introduces negligible computational overhead:

  • Logit adjustment costs O(C)O(C) per forward/backward pass for CC classes.
  • Maintaining the sliding-window prior estimator costs O(B+C)O(B+C) per step (BB: batch size).
  • Memory cost is O(C)O(C).
  • The approach is orthogonal, compatible with cross-entropy-based continual-learning methods—rehearsal-oriented or not—and can be "dropped in" without requiring major changes to the learning pipeline (Huang et al., 2023).

6. Empirical Performance in Continual Learning

PAS demonstrates statistically significant improvements over baseline and state-of-the-art approaches on established continual-learning benchmarks:

  • Online class-incremental CIFAR-10 (5 tasks, buffer M=500M=500): Experience Replay (ER) baseline achieves 40.9%40.9\%; ER+PAS attains 51.7%51.7\%, a gain of +10.8+10.8 percentage points. For M=1M=1 k, ER+PAS reaches 55.3%55.3\%, matching or exceeding prior bests.
  • CIFAR-100 (10 tasks, M=2M=2 k): ER+PAS improves over prior best by +1.7+1.7 pp (25.3%27.0%25.3\%\rightarrow27.0\%).
  • TinyImageNet (10 tasks): ER+PAS improves accuracy by +1.6+1.6 pp.
  • Long sequence (ImageNet-1k, 100 tasks, M=20M=20 k): ER baseline 7.5%7.5\%, ER+PAS 9.9%9.9\%.
  • Blurry Online CL (CIFAR-100): ER+PAS improves from 19.6%19.6\% to 24.9%24.9\%.
  • PAS remains effective when used alongside advanced replay strategies, e.g., MIR, ASER, OCS (gains of +1.7+1.7 to +9.1+9.1 pp), and knowledge-distillation methods in general continual learning (+2.1+2.1 to +4.9+4.9 pp).

Ablative analysis indicates:

  • τ=0\tau=0: Recovers baseline ER performance.
  • τ\tau\rightarrow\infty: Leads to minimal forgetting but lower accuracy.
  • Use of random or "macro" (global) priors is inferior to the sliding-window estimator.
  • Empirically, τ1\tau\approx1 and l1l\approx1 optimize the trade-off between stability and plasticity (Huang et al., 2023).

7. Limitations and Domain of Applicability

PAS specifically addresses class-prior imbalance. It does not compensate for domain shift in the class-conditional distributions p(xy)p(x|y). In settings where class imbalance is not present (i.e., domain-incremental learning without class skew), PAS affords no benefit. Accurate online estimation of priors is essential: mismatches between the sliding window length and true dynamics of the data stream can reduce efficacy. In extremely low-data regimes per class, estimation noise in the log-prior can negatively impact performance and may require additional smoothing. PAS does not replace mechanisms required to address feature collapse or domain drift (Huang et al., 2023).


In summary, the Prevalence-Adjusted Softmax Score systematically corrects for class-prior bias by adding a log-prior adjustment to each class logit before the softmax. This statistically informed, easily implemented modification yields substantial empirical gains in continual learning scenarios with negligible computational cost, provided class-conditional stationarity and reliable prior estimation are maintained (Huang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prevalence-Adjusted Softmax (PAS) Score.