Factorizable Joint Shift (FJS)

Updated 28 January 2026

Factorizable Joint Shift (FJS) is a framework that factorizes the density ratio between source and target distributions into separate functions of inputs and labels, generalizing covariate and label shift.
Estimation procedures using Joint Importance Aligning (JIA) and EM-style algorithms enable practical correction of posterior probabilities with theoretical guarantees.
FJS plays a critical role in transfer learning by empirically outperforming traditional methods while introducing challenges like non-uniqueness and numerical complexity.

Factorizable Joint Shift (FJS) is a statistical assumption and modeling framework for domain adaptation and dataset shift, unifying and generalizing covariate shift, label shift, and more general forms of non-stationarity encountered when learning under distributional mismatch between the training (source) and test (target) domains. Under FJS, the change from source to target joint distribution is described by a density ratio that factorizes into decoupled, multiplicative functions of the covariates and the labels, respectively. This ansatz enables principled importance-weighting, the formulation of both theoretical guarantees and practical estimation procedures, and correction formulae for posterior probabilities under shift in both classification and regression settings (He et al., 2022, 2207.14514, Tasche, 21 Jan 2026).

1. Formal Definition and Structural Properties

The canonical setup considers random variables $(X, Y)$ on input space $\mathcal X$ and label space $\mathcal Y$ , and two probability distributions: the source $P_{(X,Y)}$ and the target $Q_{(X,Y)}$ (or, equivalently, densities $p_S(x, y)$ and $p_T(x, y)$ ). The joint shift is assumed to be absolutely continuous, so the Radon–Nikodym derivative exists:

$f(x, y) = \frac{dQ}{dP}(x, y).$

The source and target distributions satisfy a factorizable joint shift if

$f(x, y) = h(x) g(y), \quad \text{for measurable, non-negative } h: \mathcal X \to [0, \infty),\; g: \mathcal Y \to [0, \infty),$

equivalently,

$p_T(x, y) = h(x)\, g(y) \,p_S(x, y).$

This strictly generalizes both:

Covariate shift: $g(y) \equiv 1$ .
Label shift: $h(x) \equiv 1$ .

In multiclass classification ( $Y \in \{1, \dots, d\}$ ), one may alternatively write:

$p_T(x, y) = g(x) b(y) p_S(x, y),$

for $g: \mathcal X \rightarrow \mathbb R_{ \ge 0}$ , $b: \{1, \dots, d\} \rightarrow \mathbb R_{\ge 0}$ . The normalization $\int g(x)\,b(y)\,p_S(x, y) dx\,dy=1$ ensures $p_T$ is a valid probability density (2207.14514).

FJS is non-unique: the functions $h, g$ are only determined up to scaling ( $h',g') = (c h, g/c)$ for $c > 0$ .

2. Relationship to Classical Shift Models and Generalizations

FJS encompasses and strictly contains several classical shift regimes. This is summarized in the following table:

Assumption	Definition	Factorization
Covariate shift	$p_S(y\|x) = p_T(y\|x)$	$U(x) = p_T(x)/p_S(x),\;V(y)=1$
Label shift	$p_S(x\|y) = p_T(x\|y)$	$U(x) = 1,\;V(y) = p_T(y)/p_S(y)$
Domain-invariance	$\exists\,g: p_S(Z,Y)=p_T(Z,Y)$ for $Z=E(x)$	$U(x)=p_T(x)/p_S(x),\;V(y)=1$
Generalized LS	$U(x)=p_T(x\|Z)/p_S(x\|Z),\;V(y)=p_T(y)/p_S(y)$	As indicated
FJS	none—factorizes with no further assumptions	$U(x),V(y)$

In deterministic labeling (i.e., $Y = f(X)$ ), FJS degenerates to Generalized Label Shift (GLS); in truly stochastic or regression settings, FJS is strictly more general (He et al., 2022, Tasche, 21 Jan 2026).

FJS also admits a sequential interpretation: joint shift that factorizes can arise from applying covariate and label shift consecutively, regardless of order (Tasche, 21 Jan 2026).

3. Estimation Procedures and the Joint Importance Aligning Framework

Under FJS, importance weighting requires estimation of the joint ratio $w(x, y) = p_T(x,y)/p_S(x,y) = h(x)g(y)$ . Two main strategies prevail:

3.1 Joint Importance Aligning (JIA)

The JIA estimator seeks $U,V$ such that $U(x)V(y) \approx p_T(x, y)/p_S(x, y)$ . In the supervised (fully-labeled) setting, the JIA objective is:

$\ell_{\rm sup}(U,V) = \mathbb{E}_{(x,y)\sim p_S}[\log(1 + U(x)V(y))] + \mathbb{E}_{(x,y)\sim p_T}\left[\log\left(1 + \frac{1}{U(x)V(y)}\right)\right].$

The unique minimizer satisfies $U^*(x)V^*(y) = p_T(x, y)/p_S(x, y)$ (He et al., 2022).

For the unsupervised case, observing only $x$ , one optimizes:

$\ell_{\rm unsup}(U, V) = \mathbb{E}_{x\sim p_S}[\log(1 + U(x)\widetilde V(x))] + \mathbb{E}_{x\sim p_T}\left[\log\left(1 + \frac{1}{U(x)\widetilde V(x)}\right)\right],$

with $\widetilde V(x) = \mathbb{E}_{y\sim p_S(y|x)}[V(y)]$ . This only constraints the marginals: $U(x)\widetilde V(x) = p_T(x)/p_S(x)$ . To prevent trivial solutions ( $V\equiv1$ ), regularization is imposed, e.g., via a subdomain clustering parameterization on $U(x)$ (He et al., 2022).

3.2 EM-Style Algorithms and Alternatives

For general (possibly continuous) label spaces, $g(y)$ may be estimated using an EM-like recursion. Starting from $g^{(0)}(y) \equiv 1$ ,

E-step: $q^{(n)}_{Y|X=x}(y) = g^{(n)}(y)/\int g^{(n)}(z) P_{Y|X=x}(dz)$ ,
M-step: $g^{(n+1)}(y) = \int q^{(n)}_{Y|X=x}(y) h_X(x) P_{X|Y=y}(dx)$ .

This procedure generalizes the Saerens–Jacobs EM algorithm for class priors to arbitrary label spaces and FJS (Tasche, 21 Jan 2026, 2207.14514).

3.3 Identifiability and Uniqueness

FJS is not fully identifiable with only unlabeled test features unless additional assumptions are made. For $d=2$ , identifiability up to scale holds; for $d>2$ , extra normalization or external information (e.g., knowledge of target priors) is required (2207.14514).

4. Posterior Correction and Predictive Inference under FJS

FJS admits closed-form correction formulae for the posterior $p_T(y|x)$ under shift.

General distribution shift: For $h_i(x), Q[Y=i], P[Y=i], P[Y=i|x]$ ,

$p_T(y|x) = \frac{ h_y(x) \frac{Q[Y=y]}{P[Y=y]} P[Y=y|x] }{ \sum_{j=1}^{d} h_j(x) \frac{Q[Y=j]}{P[Y=j]} P[Y=j|x] }.$

FJS (factorizable case): For $h_y(x)\equiv g(x)$ ,

$p_T(y|x) = \frac{b(y)\,P[Y=y|x]}{\sum_{j=1}^d b(j)\,P[Y=j|x]},$

with $b(y) \propto Q[Y=y]/P[Y=y]$ up to scale (2207.14514, Tasche, 21 Jan 2026).

For regression, the FJS-corrected regression function is

$\mathbb{E}_Q[Y | X=x] = \frac{ \int y\,g(y)\,P_{Y|X=x}(dy) }{ \int g(y)\,P_{Y|X=x}(dy) }.$

This correction governs both predictive mean and uncertainty when $Y$ is continuous (Tasche, 21 Jan 2026, He et al., 2022).

5. Empirical Illustration and Applications

Synthetic experiments demonstrate the necessity of FJS over traditional shift models. In one example, the target distribution is uniform over a hexagon, while the source is biased toward certain subregions. The induced importance weights are piecewise-constant but factorize along orthogonal axes (e.g., income and health status), thus violating standard covariate, label, or GLS assumptions yet satisfying FJS (He et al., 2022).

Quantitative comparison of negative log-likelihood (NLL) on this dataset:

Method	Target NLL
Target Only	$0.600 \pm 0.002$
Source Only	$0.72 \pm 0.01$
CS (SSBC)	$0.71 \pm 0.01$
LS (BBSC)	$0.73 \pm 0.01$
DANN	$0.70 \pm 0.01$
IWDAN (GLS)	$0.69 \pm 0.01$
JIADA (FJS)	$0.61 \pm 0.01$

The FJS-based JIADA method nearly matches target-only performance and significantly outperforms other importance-weighting schemes (He et al., 2022). The choice of cluster number $K$ for the $U(x)$ parameterization exhibits low sensitivity.

6. Connections to Sample Selection Bias

FJS naturally models situations where sample selection occurs with a factorizable probability $\varphi(x, y) = g(x) b(y)$ . In this context, the selected distribution $Q$ relates to the population $P$ via

$\frac{dQ}{dP}(x, y) = \frac{g(x)b(y)}{P[S]},$

where $P[S] = \mathbb{E}_P[\varphi(X, Y)]$ is a normalization. This yields explicit class-wise and point-wise selection-bias formulas and correction terms. In particular, one can recover the source (population) posterior via

$P[Y=y|x] = \frac{b(y)^{-1} Q[Y=y|x]}{ \sum_{j=1}^d b(j)^{-1} Q[Y=j|x] }.$

Bounds and admissibility conditions for candidate FJS solutions are also available in this setting (2207.14514).

7. Limitations and Future Directions

Non-uniqueness of parametrization: $h$ and $g$ are determined only up to multiplicative scaling; only their product is identifiable.
Rigidity of the factorization: The requirement that the joint density ratio factorizes may be unrealistic in some real-world domains, especially if $P$ and $Q$ have non-overlapping support or intricate dependencies.
Estimation challenges: Solving the consistency equations or the EM recursion for continuous labels can be numerically challenging, particularly in high dimensions (Tasche, 21 Jan 2026).
Open problems: Future work includes studying statistical rates for the generalized EM approach, flexible (e.g., nonparametric or normalizing flow-based) estimation for $h$ and $g$ under consistency constraints, extensions to "sparse joint shift" and other composite models, and systematic empirical validation on real data (Tasche, 21 Jan 2026).

FJS provides an analytically tractable and practically significant extension of domain adaptation methodology, enabling principled correction for joint covariate and label bias when the shifts are independent and multiplicative. Its theoretical framework, estimation algorithms, and correction formulae are foundational tools for modern transfer learning, especially in settings where both feature and label shifts are present and cannot be decoupled by simpler covariate- or label-shift-only models (He et al., 2022, 2207.14514, Tasche, 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Domain Adaptation with Factorizable Joint Shift (2022)

Factorizable Joint Shift in Multinomial Classification (2022)

Factorizable joint shift revisited (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Factorizable Joint Shift (FJS).