Deviation Inequalities for Self-Normalized Averages

Updated 25 January 2026

The paper introduces sharp large-deviation asymptotics and moderate deviation theorems for self-normalized averages under minimal moment conditions.
It employs advanced probabilistic techniques including exponential supermartingales, PAC-Bayesian methods, and saddlepoint approximations to establish dimension-free concentration bounds.
The methodology has broad applications in statistical inference, high-dimensional data analysis, adaptive policy learning, and robust hypothesis testing.

Deviation inequalities for self-normalized averages govern tail and moderate deviation behaviors of random sums normalized by random scale or variance proxies, encompassing both classical Student-type statistics and complex self-normalizations relevant for dependent, heavy-tailed, or vector-valued data. This theory combines sharp large-deviation asymptotics, optimal-moment-based moderate deviation theorems, and dimension-free concentration bounds under minimal conditions, providing a unified probabilistic toolkit with direct statistical and algorithmic applications.

1. Definitions and Self-Normalized Statistics

Let $(X_1,X_2,\dots)$ be independent or weakly dependent real random variables. The classical self-normalized sum and generalizations take the form

$W_n = \frac{S_n}{V_n}, \qquad S_n = \sum_{i=1}^n X_i,~ V_n = \biggl(\sum_{i=1}^n |X_i|^p\biggr)^{1/p} \quad (p>1)$

or for multidimensional/vector-valued $(X_i)$ ,

$S_n \in \mathbb{R}^d,\qquad V_n = \sum_{i=1}^n X_i X_i^\top$

with normalization in Mahalanobis or operator norm. More general forms involve convex scale functions $u(x)$ , empirical variances, or adaptive normalizers in block schemes for dependent data.

These statistics are bounded by $|W_{n,p}| \le 1$ via Hölder’s inequality, and their deviation properties rely only on mild exponential moment or finite $p$ -th moment conditions on $(X_i)$ or $(X_i, |X_i|^p)$ (Borovkov, 21 Jan 2025).

2. Large-Deviation Asymptotics

The large-deviation regime for self-normalized averages was reduced to a bivariate random walk problem (Borovkov, 21 Jan 2025). For iid $(X_1)$ , define the bivariate sum $Z_n = \sum_{j=1}^n (X_j, |X_j|^p)$ . The key rate function $I(x)$ arises from the Legendre–Fenchel transform of the cumulant generating function: $A(\theta) = \ln\,\mathbb{E}\exp\left\{ \theta_1 X + \theta_2 |X|^p \right\}, \quad I(x) = \sup_\theta \left\{ \langle \theta, x\rangle - A(\theta) \right\}$ For a normalized threshold $z$ , consider the admissible wedge $B_z = \{ (x_1, x_2) : x_1 \geq z x_2^{1/p},~x_2 \geq 0 \}$ . Then,

$\lim_{n\to\infty} \frac{1}{n} \log \mathbb{P}(W_{n,p} \geq z) = -\inf_{x\in B_z} I(x)$

The boundary calculus yields the extremal point, giving the explicit rate via a Shao-type formula: $I_z = \sup_{c \geq 0} \inf_{t \geq 0} \log\,\mathbb{E}\exp\{ t[c X - (z/p)(|X|^p + (p-1) c^{p/(p-1)})] \}$ In the non-degenerate Cramér case, exact non-logarithmic tail asymptotics are available: $\mathbb{P}(W_{n,p} \geq z) = C(z) n^{-1/2} \exp(-n I_z)[1 + o(1)]~\text{as}~n \to \infty$ with $C(z)$ expressed via the normal direction, curvature, and covariance under the tilted law.

Multivariate and general convex self-normalizers fit identically via embedding into higher-dimensional random walks and bivariate (or multivariate) LD theory (Borovkov, 21 Jan 2025).

3. Moderate Deviations and Optimal Moment Conditions

Self-normalized moderate deviation theorems capture the accuracy of normal approximations for sums normalized by random standard deviations (Shao et al., 2014, Gao et al., 2021). For independent or weakly dependent $(X_i)$ , the moderate deviation for $W_n = S_n/V_n$ is

$\frac{\mathbb{P}(W_n \geq x)}{1 - \Phi(x)} = 1 + O((1+x)^3 L_{n,x})$

where $L_{n,x}$ encodes truncated third moments: $L_{n,x} = \sum_i \mathbb{E}[|X_i|^3 1(|X_i| \leq 1/x) + X_i^2 1(|X_i| > 1/x)]$ The optimal range for normal approximation is $x = o(n^{1/6})$ under finite third moment. For higher moments $p\in(2,3]$ , the range extends to $x = o(n^{1/2-1/p})$ , with $L_{n,x} = O(n\,\mathbb{E}[|X_1|^p]\,x^p)$ (Shao et al., 2014, Gao et al., 2021).

For weak dependence (e.g.\ geometric $\beta$ -mixing, GMC), block schemes such as big-block/small-block and interlacing schemes yield analogous results, with dependence rate and moment assumptions determining the deviation range and error rate (Chen et al., 2014, Gao et al., 2021). Blockwise self-normalized statistics are robust to dependency, and interlacing schemes outperform big/small block methods in finite sample performance.

4. Dimension-Free and Vector-Valued Concentration Inequalities

Modern deviation inequalities extend to vector-valued and infinite-dimensional settings with empirical or predictable quadratic variation normalizers (Akhavan et al., 28 Jul 2025, Metelli et al., 3 Aug 2025, Ziemann, 2024, Martinez-Taboada et al., 5 Nov 2025, Whitehouse et al., 2023, Chugg et al., 8 Aug 2025). For a martingale sum $S_n = \sum_j Y_j X_j$ in Hilbert space $\mathcal{H}$ , normalized by $\langle S \rangle_n$ ,

$\mathbb{P}\left\{ \exists n: \|(\langle S \rangle_n + \rho^\star_n I)^{-1/2} S_n\| \geq \sqrt{2(\rho^\star_n + y + \iota_n)} + \frac{y + \iota_n}{3\sqrt{\rho^\star_n}} \right\} \leq e^{-y}$

with explicit control via Gaussian-width/information gain $\gamma(\rho^{-1}V_n)$ . This removes ambient dimension dependence, yielding bounds in terms of log-determinant and spectral quantities. For vector-valued self-normalized Bernsteins, PAC-Bayesian analysis yields concentration bounds matching multivariate CLTs up to constants, with full adaptation to variance and block structure (Ziemann, 2024, Whitehouse et al., 2023, Chugg et al., 8 Aug 2025).

Bernstein-like and Bennett-like vector self-normalized bounds extend Freedman's classical inequality to light-tailed, kernelized, or heteroscedastic settings, using empirical and predictable covariance structure (Metelli et al., 3 Aug 2025, Martinez-Taboada et al., 5 Nov 2025, Whitehouse et al., 2023). Empirical Bernstein bounds are available with sample-variance-only normalization, showing sharp, dimension-free and data-adaptive concentration for means in both martingale and weakly mixing structures (Yuan, 1 Dec 2025).

5. Applications and Extensions

Self-normalized deviation inequalities underlie key statistical methodologies:

Student’s $t$ and Studentized U-statistics (sharp moderate deviation bounds under minimal moments) (Shao et al., 2014)
Kernelized and generalized linear bandit confidence sets and regret bounds (explicit dimension-free, variance-adaptive bounds via information gain) (Metelli et al., 3 Aug 2025, Akhavan et al., 28 Jul 2025, Whitehouse et al., 2023)
High-dimensional time series analysis: simultaneous confidence intervals and multiple testing under weak dependence (Chen et al., 2014)
Empirical Bernstein-type bounds for adaptive policy learning, off-policy evaluation, and dependent arms in bandit models (Girard et al., 17 Oct 2025, Yuan, 1 Dec 2025)
Self-normalized statistics in G-expectation (sub-linear expectation/capacity framework), extending robustness to model and volatility uncertainty (Zhang, 2015)
Self-normalized deviation inequalities with best possible constants for $t$ -statistics under symmetry or bounded moments (Fan, 2016)

6. Proof Techniques and Conceptual Advances

Deviation inequalities for self-normalized averages exploit exponential supermartingale constructions, PAC-Bayes/variational principles, stitching/peeling arguments, change-of-measure and saddlepoint approximations, blocking/coupling to isolate dependence, and explicit handling of remainder and quadratic variation terms (Borovkov, 21 Jan 2025, Shao et al., 2014, Ziemann, 2024, Chugg et al., 8 Aug 2025, Martinez-Taboada et al., 5 Nov 2025, Yuan, 1 Dec 2025). The theory emphasizes:

Tight error control in Cramér-type theorems via explicit error factors and optimal-moment truncation
Sharp moderate deviation range corresponding exactly to available moments (no exponential moments required)
Dimension-free, variance-adaptive bounds in high-dimensional and kernel (RKHS) settings
Applicability to mixing and dependent data via block schemes and coboundary martingale decompositions
Robustification to model uncertainty via G-expectation/capacity arguments
Uniform time/epoch concentration via stitching/peeling, suitable for time-uniform confidence sequences

These tools provide theoretical justification for statistical procedures in estimation, online learning, and adaptive policy optimization, matching normal approximations up to sharp constants and extending to settings with minimal distributional assumptions.

7. Summary Table: Main Classes of Deviation Inequalities

Class/Setting	Main Deviation Bound	Assumptions
i.i.d. self-normalized	$\mathbb{P}(W_n \geq x)/(1-\Phi(x)) \to 1$	$E\|X\|^3<\infty$ ; $x = o(n^{1/6})$
Vector, sub-Gaussian	$\\|S_n\\|_{V_n^{-1}}^2 \leq \text{det-based terms}$	Conditional sub-Gaussianity
Bernstein/Bennett type	$\\|S_n\\|_{V_n^{-1}} \lesssim \sqrt{\mathrm{Tr}(V_n^2)\ln(1/\delta)} + B\ln(1/\delta)/3$	Bounded variance; Bernstein condition
Block-dependent	$P(W_n \geq x)/(1-\Phi(x)) = 1+O(\cdot)$	Absolute regularity; GMC mixing
Sample variance only	$\|\bar Z_n - \mu\| \leq \nu_n(\delta) \sqrt{{2[V]_n \ln(1/\delta)/n^2}}$	Martingale diff; bounded increments
Capacity/G-expectation	$\lim \tfrac{1}{x_n^2} \ln \mathcal{V}\{ S_n/V_n \geq x_n \} = -1/2$	Sub-linear exp; slow tail decay