Codeword Uniformity Regularization

Updated 27 January 2026

The paper introduces a method that applies a layer-wise hyperspherical (Riesz) energy penalty to separate normalized weight vectors and enhance representation diversity.
It integrates into self-supervised frameworks like BYOL, consistently reducing feature redundancy and yielding measurable accuracy gains on datasets such as CIFAR-10 and ImageNet.
Inspired by potential theory and the discrete Thomson problem, the approach offers a geometric strategy to achieve uniform weight distributions, improving robustness and downstream task performance.

Codeword uniformity regularization is a technique for encouraging neural network weight vectors to be maximally separated on the hypersphere, with the goal of improving representation diversity and uniformity in self-supervised learning. It is implemented as a layer-wise penalty based on minimizing the hyperspherical (Riesz) energy among normalized neuron weights (“codewords”). This approach was introduced in the context of Bootstrap Your Own Latent (BYOL), a non-contrastive self-supervised learning framework, to address the empirically observed issue that BYOL’s representations are less uniformly distributed in feature space compared to contrastive methods. Codeword uniformity regularization systematically drives network kernels apart on the sphere, thereby reducing feature redundancy, increasing entropy, and ultimately supporting improved accuracy and robustness in downstream tasks (Durrant et al., 2021).

1. Mathematical Formulation of Codeword Uniformity

Let a neural network layer contain $N$ weight-vectors (“codewords”), denoted

$W_N = \{w_i \in \mathbb{R}^{d+1} \mid i = 1, \dots, N\}, \qquad \hat{w}_i = \frac{w_i}{\|w_i\|_2} \in S^d\,.$

The Riesz- $s$ hyperspherical energy quantifies the pairwise “repulsion” between these unit-normalized vectors: $E_{s,d}(\{\hat{w}_i\}) = \sum_{i=1}^N \sum_{\substack{j=1 \ j \neq i}}^N r_s(\|\hat{w}_i - \hat{w}_j\|_2),$ where the kernel function is

$r_s(z) = \begin{cases} z^{-s}, & s > 0, \ \log(z^{-1}), & s = 0. \end{cases}$

In many experiments, the “angular” variant is used, substituting Euclidean distance with geodesic (angular) distance on the sphere: $E^a_{s,d}(\{\hat{w}_i\}) = \sum_{i \neq j} r_s(\arccos(\hat{w}_i^\top \hat{w}_j)).$ This energy is strictly a sum of pairwise terms with no higher-order interactions or additive constants.

2. Integration into Self-Supervised Learning Objectives

In the standard BYOL framework, parameterized by online encoder $f_\theta$ , target encoder $f_\xi$ , and predictor $q_\theta$ , the objective is: $\mathcal{L}_{\mathrm{BYOL}}(\theta, \xi) = \mathbb{E}_{x, y \sim P_{\mathrm{pos}}} \bigl\| \bar{q}_\theta(f_\theta(x)) - \bar{f}_\xi(y) \bigr\|_2^2,$ with $\bar{v} = v/\|v\|_2$ . The codeword uniformity regularization augments this objective as: $\mathcal{L}(\theta, \xi) = \mathcal{L}_{\mathrm{BYOL}}(\theta, \xi) + \lambda_{\mathrm{mhe}} \sum_{j=1}^{L_\theta} \frac{1}{N_j(N_j-1)} E_{s,d}(\{\hat{w}_i^{(j)}\}),$ where $L_\theta$ indexes the regularized layers, $N_j$ is the width of layer $j$ , and $\lambda_{\mathrm{mhe}}$ modulates the penalty strength (Durrant et al., 2021).

3. Training Algorithm and Implementation Considerations

During each training iteration, for every regularized layer:

Compute the set of normalized weight-vectors $\{\hat{w}_i^{(j)}\}$ .
Calculate all pairwise distances (Euclidean or geodesic), form the energy sum $E_{s,d}$ or $E^a_{s,d}$ .
Add the mean-normalized penalty $\lambda_{\mathrm{mhe}}/(N_j(N_j-1)) E$ to the total loss.
Exclude batch norm and bias parameters from MHE.
Perform backpropagation to compute $\partial E/\partial w_i$ jointly with conventional gradients, and update weights accordingly, using optimizers such as LARS/SGD and EMA for the target encoder.

Implementation details include dividing by $N_j(N_j-1)$ to maintain scale consistency across layers, and optionally selecting the angular variant, which offers empirically superior uniformity at additional computational cost (Durrant et al., 2021).

4. Theoretical Rationale for Uniformity Enhancement

Minimizing the Riesz energy among codewords directly relates to the discrete Thomson problem, which seeks point configurations on a sphere that maximize mutual separation. For $s > 0$ , the global minimum maximizes distances; for $s = 0$ , the formulation $\sum_{i<j} \log(\|\hat{w}_i - \hat{w}_j\|^{-1})$ is equivalent to maximizing the product of all pairwise distances, yielding maximal minimal angular separation. The resulting uniform spread of weights reduces feature redundancy and increases representation entropy. Empirically, this effect is tracked via the downstream uniformity metric

$G_2 = \log \mathbb{E}_{x,y}[e^{-2\|f(x) - f(y)\|^2}],$

which is consistently lower (more uniform) throughout training when the penalty is applied (Durrant et al., 2021).

5. Empirical Outcomes: Feature Distribution and Task Performance

Quantitative experiments using BYOL with and without the hyperspherical energy regularizer demonstrate:

More uniform feature representation: On CIFAR-10, features mapped to $S^1$ exhibit a near-uniform ring when regularized (vs. clumping without).
Lower layer-wise hyperspherical energy on all blocks of ResNet-18.
Improved uniformity metrics ( $G_2$ ) during training.

Performance comparisons reveal consistent improvements:

Dataset & Setting	BYOL Baseline	BYOL + MHE	Gain
ImageNet 1000 epochs	74.1%	74.4%	+0.3%
ImageNet 300 epochs	71.9%	72.4%	+0.5%
CIFAR-10 (ResNet-50)	94.46%	94.78%	+0.32%
CIFAR-100 (ResNet-50)	72.10%	72.56%	+0.46%
STL-10	82.81%	83.96%	+1.15%

Additional findings:

Enhanced robustness to small batch sizes: accuracy drops by only ~5.8% with MHE when reducing batch 1024→128 on CIFAR-10, compared to ~9.1% when using UniformityLoss (Durrant et al., 2021).
Layer ablation studies: regularizing encoder, projector, or predictor (alone or combined) always improves over the BYOL baseline.
Lower energy and uniformity metrics translate to observable gains in linear and k-NN evaluations.

6. Hyperparameter Tuning

The optimal value of the regularization strength $\lambda_{\mathrm{mhe}}$ and Riesz kernel power parameter $s$ is dataset-dependent:

ImageNet (1000 epochs): $\lambda_{\mathrm{mhe}} = 1$ .
CIFAR: grid search over $\{10^{-3}, 10^{-2}, 1, 10, 100\}$ , selected $\lambda_{\mathrm{mhe}} = 10$ .
The kernel power $s = 2$ and especially its angular variant ( $s = a2$ ) yielded the best results in uniformity and accuracy (Durrant et al., 2021).

A plausible implication is that both the penalty coefficient and the distance metric (Euclidean vs. angular) may require task-specific configuration to realize maximal gains.

7. Context and Broader Significance

Codeword uniformity regularization addresses the representation collapse and clumping observed in non-contrastive self-supervised methods, offering principled geometric encouragement for diversity of learned features. Its pairwise Riesz kernel formulation is directly motivated by classical potential theory, yielding interpretable control over the uniformity of weight distributions. The method demonstrates not only modest but consistent gains in linear evaluation accuracy and k-NN performance, but also robustness to variations in batch size and architectural configuration. Its relevance is not restricted to image domains or specific backbone architectures, indicating a general mechanism for augmenting non-contrastive learning pipelines (Durrant et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Hyperspherically Regularized Networks for Self-Supervision (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Codeword Uniformity Regularization.