Adversarial Representation Learning for Fairness

Updated 9 February 2026

Adversarial representation learning is a method that removes sensitive attribute influence via minimax games, ensuring group fairness such as demographic parity.
The model architecture integrates an encoder, predictor, and adversary (or selector) to balance prediction accuracy and fairness by eliminating direct and proxy signals.
Empirical results demonstrate that this approach improves fairness metrics on datasets from high-stakes domains like criminal justice and finance without sacrificing performance.

Algorithmic fairness via adversarial representation learning refers to a class of methodologies that construct data representations invariant to protected attributes (e.g., race, gender) using minimax optimization games involving adversarial networks. These approaches ensure that downstream predictions cannot exploit direct or proxy-sensitive information, thus providing statistical guarantees (e.g., group fairness, demographic parity) and improving trustworthiness in high-stakes domains such as criminal justice and finance.

1. Fundamental Problem and Minimax Objective

The core objective is to remove or minimize the influence of sensitive features in a learned representation $\varphi(x)$ . In a standard adversarial representation learning framework, an encoder (feature extractor) learns representations from input data, while an adversary is trained simultaneously to predict the sensitive attribute from the learned features. The encoder is optimized so as to “fool” the adversary, making it difficult or impossible to recover the sensitive attribute, while retaining as much information as possible for the target prediction task.

Formalization

In the typical minimax setting:

$\min_{\varphi, C} L_y(C(\varphi(x)), y) - \lambda \cdot \max_D L_s(D(\varphi(x)), s),$

where

$L_y$ is prediction loss (e.g., cross-entropy for classification),
$L_s$ is adversary loss (e.g., cross-entropy for sensitive attribute prediction),
$\lambda$ is a hyperparameter trading off accuracy and fairness.

Distinct variants modify the adversarial architecture and loss function to target more specific forms of proxy leakage or to accommodate conditional or more stringent forms of independence.

For instance, in the FAIAS framework, the adversarial objective is redefined to maximize the sensitivity of the predictive model to the sensitive feature, and the encoder is optimized to suppress this influence, leading to the minimax objective (Wang et al., 2019):

$\min_\varphi \max_\theta \left[ l_y(\theta, \varphi) + \lambda \cdot l_s(\theta, \varphi) \right].$

2. Model Architectures and Feature Selection

Encoder/Feature Extractor

Rather than directly handcrafting the mapping $\varphi(x)$ , the encoder may employ a “selector” network $g_\theta(x) \in [0,1]^d$ to assign each input feature $x_j$ a probability $p_j$ of being included. A binary mask $s_j \sim \text{Bernoulli}(p_j)$ is drawn per feature, producing the latent representation:

$\varphi(x) = x \odot s,$

where unselected features are zeroed out.

Predictor and Adversarial Network

The predictor $C$ (or $f_\phi$ ) is a neural network mapping from $\varphi(x)$ (with or without the sensitive feature) to the target label score.
The adversarial component can take the form of a network $D$ trying to recover the sensitive attribute from $\varphi(x)$ or, as in selector-based setups, a network $g$ that selects features to maximize the predictor’s sensitivity to the protected attribute.

This architecture ensures the representation discards direct and proxy-sensitive signals while maintaining predictive utility (Wang et al., 2019).

3. Optimization Strategies and Training Algorithms

The training procedure generally alternates between two updates:

Adversary (or Selector) Update (Maximization): Parameters $\theta$ are updated to maximize predictor sensitivity to the sensitive attribute (or maximize the adversarial loss if a discriminator tries to decode the protected group from the representation).
Encoder and/or Predictor Update (Minimization): Parameters $\varphi$ are optimized to minimize both prediction loss and adversarial sensitivity.

For a minibatch $\{(x_i, y_i)\}$ , the procedure (Wang et al., 2019):

Compute feature selection probabilities $p_i=g_\theta(x_i)$ .
Draw binary masks $s_i$ and form representations $\varphi(x_i)=x_i \odot s_i$ .
For each sample, compute model outputs with and without the sensitive feature and the difference $\Delta_i$ .
Update $\theta$ by gradient ascent on sensitivity.
Update $\varphi$ by gradient descent on both sensitivity and prediction losses.

Hyperparameters, such as learning rates ( $\alpha_\theta$ , $\alpha_\varphi$ ), batch size, and the Lagrange multiplier $\lambda$ , are tuned by validation.

4. Fairness Metrics and Theoretical Guarantees

Common group fairness metrics evaluated include:

Demographic Parity Difference: $|\Pr(\hat{y}=1|s=0) - \Pr(\hat{y}=1|s=1)|$
Equalized Odds: $\frac{1}{2} (|FPR_0-FPR_1| + |TPR_0-TPR_1|)$ , where $FPR$ and $TPR$ are false and true positive rates per group.
Equal Opportunity: $|TPR_0 - TPR_1|$
Theil Index: Information-theoretic inequality measure, minimized at zero (Wang et al., 2019).

The FAIAS architecture provably forces learned representations to discard information about the sensitive feature and its proxies, controlling these metrics. For instance, as the adversary is trained to maximize the predictor’s group sensitivity and the encoder is trained to minimize it, the learned representation approaches group-indistinguishability.

5. Representative Experimental Results

Extensive testing across benchmark datasets such as German Credit, COMPAS recidivism, and Bank Marketing demonstrates that adversarial representation learning outperforms or matches state-of-the-art baselines (logistic regression, in-processing adversarial debiasing, post-processing equal odds, disparate-impact removers, and reweighting approaches), both on raw accuracy, balanced accuracy, and all major group fairness metrics. The key empirical finding is that the “selector vs. predictor” minimax game efficiently removes proxy information without sacrificing predictive power (Wang et al., 2019).

Variations and extensions of adversarial representation learning for fairness encompass:

Domain Generalization via Adversarial Invariance: Viewing “domain” as a protected attribute, minimax objectives can enforce invariance across source domains, thus producing representations that transfer equitably to new domains (Deng et al., 2020).
Stacked Adversarial Autoencoders: Imposing fairness constraints at multiple hierarchical representation levels (latent spaces) recursively tightens fairness guarantees (e.g., demographic parity, equalized odds) (Kenfack et al., 2021).
Alternative Independence Measures: Replacing adversarial losses with objectives such as the Hirschfeld–Gebelein–Rényi maximal correlation yields representations with stronger independence properties in terms of nonlinear dependencies (Grari et al., 2020).
Semi-supervised and Instance-weighting Variants: Frameworks such as Semi-FairVAE leverage scarce labeled sensitive attributes by structuring the encoder/adversary branches to exploit unlabeled data and promote orthogonality between bias-aware and bias-free representations (Wu et al., 2022); instance-reweighting methods learn a fairness-weight per sample in an adversarial saddle-point game (Petrović et al., 2020).

7. Practical Implications and Significance

Adversarial representation learning architectures offer a principled and empirically validated strategy to obtain fair machine learning models with formal guarantees. The minimax formulation not only blocks direct leakage of sensitive attributes but also systematically eliminates proxies. These methods are flexible enough to accommodate different notions of group fairness, scale across problem domains, and deliver competitive predictive performance. Empirical and theoretical results indicate robust performance on real-world datasets and show superiority or parity with previous in-processing, pre-processing, and post-processing methods (Wang et al., 2019, Deng et al., 2020, Kenfack et al., 2021).

In summary, algorithmic fairness via adversarial representation learning constitutes a rigorous and effective class of minimax procedures to guarantee group fairness properties in machine learning, directly addressing regulatory and societal concerns about disparate impact and discrimination in automated decision-making (Wang et al., 2019).