f-Divergences: Definitions and Applications

Updated 21 January 2026

f-Divergences are statistical distances defined via a convex function, unifying multiple classical divergences such as KL, χ², and Jensen–Shannon.
They obey properties like nonnegativity, joint convexity, and data-processing inequality, and admit variational dual forms that enable efficient estimation.
Their framework extends to quantum states and operator algebras, influencing methods in hypothesis testing, generative modeling, and learning algorithms.

An $f$ -divergence is a parametric class of statistical distances between probability distributions parameterized by a convex function $f$ with $f(1)=0$ . This framework subsumes and unifies many classical divergences (Kullback–Leibler, $\chi^2$ , Hellinger, total variation, Jensen–Shannon, and Rényi), and extends, via functional calculus, to quantum states and operator algebras. $f$ -divergences have central roles in information theory, statistics, learning theory, optimization, hypothesis testing, quantum information, geometric analysis, and algorithmic diagnostics.

1. Formal Definition and Core Properties

Let $P$ and $Q$ be probability measures on a measurable space $(\mathcal X, \mathcal F)$ , and let $f:(0,\infty)\to\mathbb{R}$ be convex with $f(1)=0$ . The $f$ -divergence of $P$ from $Q$ is

$D_f(P\|Q) = \int_\mathcal{X} f\left(\frac{dP}{dQ}(x)\right)\,dQ(x)$

when $P\ll Q$ , else $+\infty$ as appropriate. In the discrete case, $D_f(P\|Q) = \sum_{x\in\mathcal X} Q(x) f\left(\frac{P(x)}{Q(x)}\right)$ (Harremoës et al., 2010, Masiha et al., 2022).

Key properties:

Nonnegativity and equality: $D_f(P\|Q)\ge0$ , with $D_f(P\|Q) = 0$ iff $P=Q$ under mild regularity.
Joint convexity: $(P, Q)\mapsto D_f(P\|Q)$ is jointly convex.
Data-processing inequality: For any Markov kernel (stochastic map) $P_{Y|X}$ , $D_f(P_Y\|Q_Y)\le D_f(P_X\|Q_X)$ . This generalizes to quantum channels for operator-convex $f$ (Hiai et al., 2010).
Dual formulation: $D_f(P\|Q) = \sup_T \{ {\mathbb E}_P[T(x)] - {\mathbb E}_Q[f^*(T(x))] \}$ , where $f^*$ is the convex conjugate of $f$ (Shannon, 2020).
Special cases (canonical $f$ choices and resulting divergences):

$f(t)$	$D_f(P\\|Q)$	Name
$t\ln t$	$KL(P\\|Q)$ (relative entropy)	Kullback–Leibler
$\|t-1\|/2$	$TV(P,Q)$ (total variation)	Total Variation Distance
$(t-1)^2$	$\chi^2(P\\|Q)$	Pearson $\chi^2$
$(\sqrt t - 1)^2$	$H^2(P,Q)$ (squared Hellinger)	Hellinger
$-(t+1)\log\frac{t+1}{2} + t\log t$	$JS(P\\|Q)$	Jensen–Shannon
$\frac{t^\alpha - 1}{\alpha-1}$	$D_\alpha(P\\|Q)$	Rényi

These properties extend naturally to infinite-dimensional and quantum generalizations under additional assumptions (Hiai et al., 2010, Matsumoto, 2013).

2. Inequalities and Sharp Bounds Between $f$ -Divergences

A unifying principle is that the set of values $(D_f(P\|Q), D_g(P\|Q))$ over all probability pairs is the convex hull of the values obtained on two-point spaces. Consequently, all extremal inequalities between any pair of $f$ -divergences are determined by binary distributions (Harremoës et al., 2010, Guntuboyina et al., 2013).

Sharp inequalities: For any $f, g$ $f, g$ , under mild conditions, $D_g(P\|Q)\ge \beta D_f(P\|Q)$ $D_{g} (P ∥ Q) \geq β D_{f} (P ∥ Q)$ and $D_g(P\|Q) \le \Gamma D_f(P\|Q)$ $D_{g} (P ∥ Q) \leq Γ D_{f} (P ∥ Q)$ for explicit universal constants $\beta$ $β$ , $\Gamma$ $Γ$ :
- $\beta = \min\{ \inf_{t>0} \frac{g(t)}{f(t)}, \inf_{t>0}\frac{g^*(0)}{f^*(0)} \}$
- $\Gamma = \max\{ \sup_{t>0} \frac{g(t)}{f(t)}, \sup_{t>0}\frac{g^*(0)}{f^*(0)} \}$

Examples include:

Pinsker’s inequality: $KL(P\|Q) \ge 2\,TV(P,Q)^2$ (Harremoës et al., 2010, 0903.1765)
Hellinger–TV bounds: $H^2(P,Q) \le TV(P,Q) \le \sqrt{2}\,H(P,Q)$

The exact tradeoff curves, as well as minimax and maximin relationships between divergences subject to constraints on others, reduce in general to finite-dimensional optimization over 2- or $(m+2)$ -point supports (if $m$ constraints) (Guntuboyina et al., 2013).

3. Variational Representations and Duality

$f$ -divergences admit a Fenchel dual (convex conjugate) formulation, foundational for both theoretical results and practical estimation algorithms (Shannon, 2020, Im et al., 2018). The Fenchel conjugate is $f^*(y) := \sup_{u>0}\{uy - f(u)\}$ .

Dual form:

$D_f(P\|Q) = \sup_{T: \mathcal X\to\mathbb R} \{ {\mathbb E}_{x\sim P}[T(x)] - {\mathbb E}_{x\sim Q}[f^*(T(x))] \}$

This underpins divergence estimation via adversarial training, the f-GAN variational framework, and the derivation of gradient expressions in learning (Shannon, 2020, Im et al., 2018, Leadbeater et al., 2021).

Gradient-matching property: When the "critic" or test function is optimal, the gradient of the variational lower bound with respect to a parameterized model matches the true gradient of the $f$ -divergence (Shannon, 2020).
Second-order local equivalence: For distributions $Q$ near $P$ , all $f$ -divergences coincide up to a scalar multiple determined by $f''(1)$ , corresponding to the Fisher Information metric (Shannon, 2020). That is,

$D_f(P, P+\epsilon h) = \tfrac12\epsilon^2 f''(1)\int [h(x)]^2/p(x)\,dx + o(\epsilon^2)$

This justifies the use of $f$ -divergences as local metrics in information geometry (Nishiyama, 2018).

4. Quantum $f$ -Divergences and Non-Commutative Generalizations

Classical $f$ -divergences admit several quantum analogues, most notably:

Petz quantum $f$ -divergence (quasi-entropy): for density operators $\rho$ , $\sigma$ on Hilbert space $\mathcal H$ ,

$S_f(\rho\|\sigma) := \operatorname{Tr}\left[\sigma^{1/2}f(\sigma^{-1/2}\rho\sigma^{-1/2})\sigma^{1/2}\right]$

Fundamental properties: - Monotonicity under CPTP (quantum channel) maps: If $f$ is operator-convex, $S_f(\Phi(\rho)\|\Phi(\sigma)) \le S_f(\rho\|\sigma)$ . - Equality case and Petz recovery: Equality for one (non-linear) $f$ entails the existence of a recovery channel (Hiai et al., 2010, Hiai et al., 2016).

Maximal quantum $f$ -divergence (Matsumoto): $D_f^{\max}(\rho\|\sigma)$ is the largest operationally justifiable quantum $f$ -divergence, satisfying data-processing for all positive TP maps and reducing to $D_f$ on commuting operators (Matsumoto, 2013, Hiai et al., 2016).
Measured/Minimal quantum $f$ -divergence: The supremum of classical $f$ -divergences over all projective decompositions (POVMs).
Sandwiched and $\alpha$ - $z$ Rényi divergences: These generalize Rényi and interpolate between various quantum divergences depending on the parameter regime.
Multi-state quantum $f$ -divergences and their monotonicity correspond to generalizations via Tomita–Takesaki modular theory and Kubo–Ando operator means (Furuya et al., 2021).

Quantum $f$ -divergences play central roles in quantum hypothesis testing, error correction, channel discrimination, and operational resource theories (Hiai et al., 2010, Matsumoto, 2013, Beigi et al., 7 Jan 2025).

5. Estimation, Statistical Limits, and Applications

Statistical Estimation

Estimation: While nonparametric estimation of $f$ -divergences is subject to slow ( $O(N^{-1/d})$ ) rates without structure, modern representation learning setups (e.g., variational autoencoders, latent variable models) enable estimators achieving parametric rates via "random mixture" (RAM) or importance-weighted Monte Carlo approximations (Rubenstein et al., 2019).
Limit theory: The asymptotic distribution of empirical $f$ -divergence estimators is governed by the functional delta method and Hadamard differentiability, yielding explicit limiting distributions under weak regularity conditions (Sreekumar et al., 2022).

Applications

Hypothesis testing: $f$ -divergences control error exponents and tight bounds in binary and multi-hypothesis testing, often yielding sharp rate constraints (Sanov-type bounds, Chernoff error, Pinsker-type inequalities) (Masiha et al., 2022, Beigi et al., 7 Jan 2025).
Lossy compression and learning theory: Mutual $f$ -information yields generalized rate-distortion functions and improved generalization error bounds, especially via super-modular $f$ -divergences (Masiha et al., 2022).
Generative modeling: GANs (notably f-GAN), variational autoencoders, quantum generative models, and dimension reduction techniques (e.g., f-SNE/t-SNE) all exploit $f$ -divergences and their duals for robust optimization (Im et al., 2018, Leadbeater et al., 2021).

6. Extensions, Mixed $f$ -Divergences, and Geometric Aspects

Mixed $f$ -divergences: Generalize to joint measurement of differences across multiple pairs of probability measures or log-concave functions, yielding vectorized divergence inequalities of Alexandrov–Fenchel and isoperimetric type. This generalizes classical information-theoretic bounds to affine-invariant settings and convex geometry (Caglar et al., 2014).
Generalized Bregman geometries: Many $f$ -divergences can be embedded into the Bregman divergence framework via appropriate reparameterizations, inheriting geometric properties such as explicit centroids, projection algorithms, and centroidal clustering (Nishiyama, 2018).
Diagnostic tools: Coupling-based diagnostics for Markov chain Monte Carlo convergence based on $f$ -divergences are computable, provide provable monotonic upper bounds, and converge to zero as chains mix (Corenflos et al., 8 Oct 2025).

7. Future Directions and Open Problems

Research continues into:

New quantum $f$ -divergence representations: Integral definitions via quantum hockey-stick divergences have enabled strengthened trace inequalities, monotonicity and channel contraction coefficients, and new proof techniques for the achievability of quantum hypothesis testing bounds (Beigi et al., 7 Jan 2025).
Operational interpretations: Multi-state and generalized quantum $f$ -divergences are conjectured to provide optimal error exponents in asymmetric hypotheses and resource protocols (Furuya et al., 2021).
Refined estimation and limit theory: Developing general trace-formulas for quantum $f$ -divergences, tightening concentration inequalities and supporting practical computation in high-dimensional settings (Rubenstein et al., 2019, Sreekumar et al., 2022, Beigi et al., 7 Jan 2025).
Algorithmic exploitation: Customized variational forms, divergence-switching, and local divergence constraints are active topics in both classical and quantum learning algorithm development (Leadbeater et al., 2021, Shannon, 2020).

$f$ -divergences thus constitute a central unifying pillar in both classical and quantum information theory, offering a flexible, sharp, and operationally meaningful framework for quantifying distributional discrepancies across a wide spectrum of mathematical, algorithmic, and physical theories.