Bit-Wise Rademacher Complexity
- Bit-wise Rademacher complexity is a measure of how well multi-bit classifiers can fit random noise, linking each bit's VC dimension to overall generalization capacity.
- The framework utilizes the ℓ∞-vector contraction theorem to obtain sharp bounds that scale as O(L√K · maxₖ Radₙ(𝓕|ₖ)) with only logarithmic overhead.
- It provides actionable insights for designing multi-label predictors, ensuring per-bit error rates decay at O(√(d/n)) even as the number of output bits increases.
Bit-wise Rademacher complexity characterizes the ability of classes of multiple binary predictors to fit random noise, providing a quantitative measure of the generalization capacity of multi-bit classification systems. In the context of vector-valued function classes mapping into , bit-wise Rademacher complexity enables sharp generalization bounds for predictors that simultaneously output binary labels, clarifying how complexity scales with the number of output bits and properties such as VC dimension.
1. Formal Definition and Notation
Given a fixed sample , let be a class of -dimensional vector-valued predictors. For each , denote the -th coordinate class as , the set of all functions formed by projecting each onto its -th coordinate. The empirical Rademacher complexity of on is
$\Rad_n(\mathcal{F}; x_{1:n}) = \mathbb{E}_{\epsilon} \left[ \sup_{f \in \mathcal{F}} \sum_{t=1}^n \epsilon_t f(x_t) \right],$
where are i.i.d. Rademacher variables. The worst-case (global) Rademacher complexity is $\Rad_n(\mathcal{F}) = \sup_{x_{1:n} \in \mathcal{X}^n} \Rad_n(\mathcal{F}; x_{1:n})$. For function composition, for a sequence of real-valued functions each -Lipschitz with respect to norm, the composed class can be analyzed via Rademacher complexity under specific contraction principles (Foster et al., 2019).
2. Vector Contraction and its Application
The principal advance is the -vector-contraction theorem for Rademacher complexity. For and -Lipschitz functions with and for all , the following upper bound holds for any , for an absolute constant :
$\Rad_n\left( \{ (x_t) \mapsto \phi_t \circ f(x_t) \mid f \in \mathcal{F} \} \right) \le C L \sqrt{K} \left(\max_{1 \le k \le K} \Rad_n(\mathcal{F}|_k)\right) \cdot \log^{3/2+\delta}\left(\frac{\beta n}{\max_k \Rad_n(\mathcal{F}|_k)}\right).$
Neglecting polylogarithmic terms yields the clean statement:
$\Rad_n(\phi \circ \mathcal{F}) = O\left(L \sqrt{K} \cdot \max_k \Rad_n(\mathcal{F}|_k)\right).$
This result highlights the scaling behavior with respect to and shows that the Rademacher complexity of the multi-bit class, after Lipschitz transformation, is controlled by the maximal per-coordinate complexity, up to a factor and logarithmic factors (Foster et al., 2019).
3. Specialization to Bit-Wise Classifiers
For bit-wise classifiers, each coordinate class consists of binary predictors. The standard VC/Rademacher theory gives
$\Rad_n(\mathcal{F}|_k) \le \sqrt{ \frac{ 2 \, \VCdim(\mathcal{F}|_k) \ln(e n / \VCdim(\mathcal{F}|_k)) }{ n } } \;\; {=: \rho_k }.$
This leads to the bound
$\Rad_n(\phi \circ \mathcal{F}) \le C L \sqrt{K} \max_k \rho_k \; \ln^{3/2 + \delta} \left( \frac{n}{\max_k \rho_k} \right).$
If $\VCdim(\mathcal{F}|_k) \le d$ for all , ignoring log factors:
$\Rad_n(\phi \circ \mathcal{F}) = O\left( L \sqrt{ \frac{K d}{n} } \right).$
This yields generalization guarantees for the average 0–1 bit-wise loss. For example, with probability :
Even though there are bits, all generalize at rate , implying per-bit error rate decays as (Foster et al., 2019).
4. Proof Outline and Vector Contraction Mechanism
The proof strategy proceeds through the following critical methodological elements:
- Dudley chaining bound: The classic Dudley entropy integral provides an upper bound in terms of empirical covering numbers:
$\Rad_n(\phi \circ \mathcal{F}) \le \inf_{0 < \alpha < 1} \{ 4 \alpha n + 12 \sqrt{n} \int_\alpha^1 \sqrt{ \ln \mathcal{N}_2( \phi \circ \mathcal{F}, \epsilon, x_{1:n} ) } \, d\epsilon \}.$
- Exploiting -Lipschitz property: By the Lipschitz condition, covering of the composed class contracts to an -covering of the vector class, leading to
- Combinatorial covering and fat-shattering: The covering number is then bounded using the Rudelson–Vershynin inequality in terms of the fat-shattering dimension, and standard results relate fat-shattering to Rademacher complexity. Together, this machinery produces the $\sqrt{K} \max_k \Rad_n(\mathcal{F}|_k)$ dependence, with only polylogarithmic slack in $n / \max_k \Rad_n(\mathcal{F}|_k)$ (Foster et al., 2019).
5. Tightness, Constants, and Lower Bounds
The hidden constant in the contraction bound depends only on the auxiliary parameter introduced in the covering number argument. Importantly, does not scale with or . The polylogarithmic term arises from the Dudley integral and the Rudelson–Vershynin bound. In applications, the complexity is commonly written as $\tilde{O}( L \sqrt{K} \max_k \Rad_n(\mathcal{F}|_k) )$, suppressing polylogarithmic factors.
A matching lower bound (Proposition 1 in (Foster et al., 2019)) establishes that the dependence is unavoidable if control over the worst-case Rademacher complexity is required. For instance, with and a specifically constructed class , the lower bound
$\Rad_n(\phi \circ \mathcal{F}; x_{1:n}) \ge \frac{ \sqrt{K} }{ 2 } \max_{k} \Rad_n(\mathcal{F}|_k; x_{1:n})$
holds. On the other hand, Maurer's -vector-contraction yields a factor but applies to the empirical Rademacher complexity. There is thus an inherent trade-off between the sharpness of the -dependence and whether the bound concerns empirical or worst-case Rademacher complexity. The -vector-contraction provided by Foster & Rakhlin is the first near-optimal contraction bound for -Lipschitz functions, exhibiting only polynomial logarithmic overhead in the sample size (Foster et al., 2019).
6. Implications and Broader Significance
The bit-wise Rademacher complexity framework establishes principled generalization guarantees for multi-label binary classification and related vector prediction problems. A plausible implication is that even as the number of output bits increases, global generalization rates are optimally controlled—up to —by the hardest bit-wise subproblem. This enables the design and analysis of high-dimensional predictors (e.g., in multi-label learning, error-correcting output codes, or structured prediction) with explicit control over both sample complexity and per-bit performance. Furthermore, the results clarify the boundary between worst-case generalization ( dependence) and empirical generalization ( dependence), guiding model selection and analysis strategies where output dimensionality is large.