Virtual Label-distribution-aware Learning (VILL)

Updated 2 February 2026

Virtual Label-distribution-aware Learning (VILL) is an unsupervised domain adaptation module that enhances category fairness by targeting poor performance in underrepresented classes.
It employs an adaptive re-weighting mechanism that amplifies the influence of minority classes based on a virtual label distribution derived from pseudo-labels.
A KL-divergence-based re-balancing strategy on target predictions promotes a uniform class distribution, yielding 3–7 percentage point gains in worst-class accuracies.

Virtual Label-distribution-aware Learning (VILL) is an architectural-agnostic module for Unsupervised Domain Adaptation (UDA) designed to address category fairness, i.e., performance disparities across classes when transferring models between domains. Traditional UDA approaches optimize for global accuracy but frequently overlook the challenge of maintaining high accuracy for “hard” categories, leading to significant variance across per-class performance. VILL augments arbitrary UDA methods with plug-and-play components that emphasize worst-class improvements without sacrificing overall performance by combining adaptive re-weighting and KL-divergence-based re-balancing mechanisms (Zhang et al., 26 Jan 2026).

1. Motivation and Definition: Category Fairness in UDA

Standard UDA training optimizes the joint loss

$\mathcal{L}_{UDA} = \mathcal{L}_{CE}(D_s) + \mathcal{L}_{DA}(D_s, D_t)$

where $\mathcal{L}_{CE}$ is the supervised cross-entropy over the source domain $D_s$ , and $\mathcal{L}_{DA}$ enhances domain invariance. Empirical analysis demonstrates that UDA classifiers often exhibit disproportionate accuracy across classes; “easy” classes—those with domain alignment—attain high accuracies, while “hard” classes lag, a phenomenon quantifiable via Worst-N accuracy (mean accuracy of the N lowest-performing target classes). For example, “Worst-5” or “Worst-10” accuracy can fall well below the average, highlighting a lack of fairness in standard UDA deployments.

VILL directly targets this discrepancy: its adaptive architecture elevates the model’s sensitivity to underperforming classes, combining (1) source loss re-weighting to emphasize hard categories and (2) target output re-balancing that nudges predictions toward uniformity. These components are entirely unsupervised with respect to the target domain and require no modification of underlying network architectures.

2. Adaptive Re-weighting via Virtual Label Distribution

At every epoch, pseudo-labels $\{\hat y_j^t\}_{j=1}^{N_t}$ are constructed for $N_t$ target samples over $C$ classes. The “virtual label distribution” is then defined:

$v_i = \frac{N_i}{N_t}, \qquad N_i = \sum_{j=1}^{N_t} \mathbbm{1}(\hat y_j^t = i)$

which is scaled as

$E_i = v_i C = \frac{N_i}{N_t}C$

for stabilization. The adaptive category weights $\omega \in \mathbb{R}^C$ use a smoothed negative-exponential transform:

$\omega_i = \frac{1 + \alpha e^{-E_i}}{\sum_{k=1}^C (1 + \alpha\, e^{-E_k})}$

with typical $\alpha=5.0$ . Low-frequency classes (minorities) receive amplified weights. The re-weighted source cross-entropy loss for a source minibatch $\{(x_i^s, y_i^s)\}_{i=1}^B$ becomes

$\mathcal{L}_{RW} = \frac{ \sum_{i=1}^B \omega_{y_i^s} \ell_{ce}(x_i^s, y_i^s) }{ \sum_{i=1}^B \omega_{y_i^s} }$

where $\ell_{ce}$ is cross-entropy, and $\mathcal{B}_c$ is the set of source examples in the batch with label $c$ . In summary,

$L_{adapt} = \sum_{c=1}^C \omega_c L_c, \quad L_c = \frac{1}{|\mathcal{B}_c|} \sum_{i \in \mathcal{B}_c} \ell_{ce}(x_i^s, c)$

This mechanism ensures categories under-represented in the pseudo-label set receive greater influence during network updates.

3. KL-divergence-based Re-balancing on Target Predictions

Complementing re-weighting, VILL introduces a loss on target predictions promoting distributional uniformity. For a target minibatch $\{x_j^t\}_{j=1}^B$ , predicted probabilities are computed as $p_j = \mathrm{softmax}(F(G(x_j^t))) \in \Delta^C$ , and their batch average as

$\bar p = \frac{1}{B} \sum_{j=1}^B p_j$

The KL divergence from $\bar p$ to the desired $\omega$ is calculated:

$D_{KL}(\bar p \Vert \omega) = \sum_{k=1}^C \bar p_k \log \frac{\bar p_k}{\omega_k} = -\sum_{k=1}^C \bar p_k \log \omega_k + \sum_{k=1}^C \bar p_k \log \bar p_k$

with the first term penalizing excessive confidence in majority classes and the second (entropy) countering output collapse. The induced loss is

$\mathcal{L}_{RB}(x^t, \omega) = D_{KL}(\bar p\Vert\omega)$

which acts only on target batches, without requiring any target labels.

4. Combined Objective and Implementation Details

VILL generalizes arbitrary base UDA methods, substituting $\mathcal{L}_{CE}$ by $\mathcal{L}_{RW}$ and supplementing KL re-balancing:

$\mathcal{L}_{VILL} = \underbrace{\mathcal{L}_{RW}}_{\text{adaptive re-weighting}} + \lambda_{DA}\,\mathcal{L}_{DA} + \lambda_{KL}\,\mathcal{L}_{RB}$

with $\lambda_{DA}$ inherited from the base method and $\lambda_{KL}$ (default $\approx 0.05$ –$0.1$) regulating the trade-off.

The training loop consists of:

Forward pass on target samples, generating pseudo-labels.
Re-computation of $E_i$ and $\omega_i$ per class.
Minibatch processing:
- Source: Compute $\mathcal{L}_{RW}$ ;
- Domain alignment: Compute $\mathcal{L}_{DA}$ (on both domains);
- Target: Compute $\mathcal{L}_{RB}$ ;
- Update parameters via gradients of $\mathcal{L}_{VILL}$ .
End-of-epoch target relabeling for refreshed $\omega$ .

A pseudocode realization follows:

initialize G, F
initialize ω_i = 1/C for all i
for epoch = 1…max_epochs:
  # 1) update pseudo-labels and ω
  for each x in D_t:
    p = softmax(F(G(x)))
    pseudo-label ŷ = argmax p
  compute counts N_i and E_i = (N_i/N_t)*C
  ω_i ← (1+α e^{−E_i}) / Σ_{k}(1+α e^{−E_k})
  # 2) train with VILL
  for each minibatch (X^s,Y^s), X^t:
    L_RW = reweighted_CE(X^s,Y^s;ω)
    L_DA = domain_alignment_loss(X^s,X^t)
    p_t = softmax(F(G(X^t)))
    L_RB = D_KL(mean(p_t) || ω)
    L = L_RW + λ_DA*L_DA + λ_KL*L_RB
    backpropagate L, update G,F

This methodology is agnostic to backbone architecture or domain alignment strategies.

5. Theoretical Foundations and Hyperparameter Choices

The adaptive weight vector $\omega_c$ approximates the inverse class frequency derived from noisy target pseudo-labels, connecting VILL to established long-tailed learning fairness strategies. The KL-divergence component structurally minimizes an $f$ -divergence between the network’s average predictive distribution and a balanced target, compelling decision boundaries to expand into under-represented classes. The procedure omits formal convergence proofs but guarantees stability under common stochastic gradient updates: all terms are continuous, differentiable, and bounded.

Typical hyperparameter ranges are:

$\alpha$ (disparity strength): 1.0–10.0, default 5.0 (balances fairness vs. stability).
$\lambda_{KL}$ (re-balancing weight): 0.01–0.1, default 0.05 (controls fairness impact).
$\lambda_{DA}$ : set as in original UDA baseline. Warm-up strategies exist, where $\mathcal{L}_{RB}$ is activated after pseudo-label maturation (usually 1–2 epochs).

6. Empirical Performance and Ablation Studies

Experiments on OfficeHome (12 classes, 4 domains) and Office-31 (31 classes, 3 domains) confirm VILL’s effectiveness:

Integration into CDAN yields a Worst-5 accuracy increase from 20.3% to 26.8% (OfficeHome), with mean accuracy stable around 68%.
Across baselines (MDD, ATDOC, CLIP-based PDA), VILL raises Worst-5/10 accuracy by 3–7 pts, sometimes improving overall accuracy by ≈0.5 pts.
On Office-31, Worst-3 accuracy gains 4–6 pts.

Ablation (OfficeHome, CDAN backbone):

Methodology	Worst-5	Worst-10	Avg
Baseline	20.3	28.7	68.0
+Re-weighting	22.4	31.2	68.7
+Re-balancing	26.0	33.7	68.0
VILL (full)	26.8	34.5	67.9

The KL re-balancing confers the greatest single boost, but maximal improvement comes from their combination, indicating both mechanisms contribute orthogonally to fairness enhancement.

7. Significance and Integration

VILL stands out for its ease of integration—no architectural changes, minimal computational overhead, and applicability to any UDA baseline. The explicit focus on category fairness distinguishes it from prior UDA protocols prioritizing only mean accuracy. A plausible implication is that future domain adaptation research might standardize worst-class metrics to report the effectiveness of fairness strategies such as VILL (Zhang et al., 26 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Learning Fair Domain Adaptation with Virtual Label Distribution (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Virtual Label-distribution-aware Learning (VILL).