Papers
Topics
Authors
Recent
Search
2000 character limit reached

He (Kaiming) Initialization

Updated 20 January 2026
  • He (Kaiming) initialization is a method that sets weight variances using fan-in scaling to counteract vanishing or exploding gradients in deep neural networks with ReLU activations.
  • It mitigates gradient instability, yielding 20–30% faster convergence and lower per-epoch loss variance compared to traditional methods like Xavier/Glorot initialization.
  • The approach has been adapted for CNNs, transformers, GNNs, and even quantum circuits, making it a versatile tool for stable training in various deep learning architectures.

He or Kaiming Initialization is a rectifier-aware weight initialization scheme foundational to the training of deep neural networks with rectified linear unit (ReLU) and related activations. Its core goal is to preserve signal and gradient magnitude across layers, supporting stable information propagation and addressing variance shrinkage associated with deep architectures. He initialization is based on variance calculations that compensate for the partial suppression of activations by the ReLU function, yielding principled guidelines for initializing weights in multilayer perceptrons (MLPs), convolutional neural networks (CNNs), transformers, and other classes of deep models.

1. Theoretical Basis and Formulation

He initialization is derived from the requirement that variance be preserved in the forward and backward passes through a network stack with ReLU nonlinearities. For a feedforward or convolutional layer, the weight matrix WRnout×ninW\in\mathbb{R}^{n_{out}\times n_{in}} is initialized with entries as i.i.d. Gaussian variables of zero mean and variance:

Var(Wij)=2nin\mathrm{Var}(W_{ij}) = \frac{2}{n_{in}}

This “fan-in” formulation ensures that, under forward propagation, the variance of the post-activation output matches the variance of the preceding layer. The critical factor of $2$ arises because the ReLU nonlinearity zeroes out half of its incoming values in expectation (crelu=1/2c_{relu} = 1/2), thus halving the second moment unless compensated by scaling. An analogous prescription, Var(Wij)=2/nout\mathrm{Var}(W_{ij}) = 2/n_{out}, preserves gradient variance in the backward pass, but forward stability is prioritized in practice (Han, 10 Oct 2025, Steinwart, 2019).

Uniform variants are also used, distributing weights as WijU[6/nin,+6/nin]W_{ij}\sim U\left[-\sqrt{6/n_{in}},+\sqrt{6/n_{in}}\right] or corresponding modifications for “fan-out” versions.

2. Variance Dynamics and Training Stability

The overarching claim is that He initialization counteracts both vanishing and exploding signal phenomena across deep stacks of layers. Empirical investigations in MLPs present three distinct regimes parametrized by the initial standard deviation σ\sigma:

  • σ2/nin\sigma \ll \sqrt{2/n_{in}}: leads to vanishing signals and stagnant training.
  • σ2/nin\sigma \approx \sqrt{2/n_{in}}: yields stable, “edge-of-chaos” signal flow and efficient training.
  • σ2/nin\sigma \gg \sqrt{2/n_{in}}: leads to exploding signals, unstable loss, and training divergence.

Experiments show that, with He initialization, convergence is 20–30% faster and loss variance per epoch is lower compared to Xavier/Glorot initialization under ReLU, with modest but statistically significant improvements in final validation accuracy (Han, 10 Oct 2025).

In the context of very deep networks, an important nuance is that while He initialization preserves the total variance of activations (over network parameters and samples), the sample variance (over data samples for a fixed random network) decays toward zero with increasing depth. This leads to all inputs in a single network realizing nearly the same high-magnitude pre-activation vector—an effect counteracted by Batch Normalization, which enforces unit sample variance per activation, re-awakening nonlinearity in each layer (Luther et al., 2019).

3. Extensions to Architectures Beyond Standard MLPs

Convolutional Networks

In convolutional nets, He initialization is generalized for kernels, with Var(W)=2/(k2d)\mathrm{Var}(W) = 2/(k^2d) for k×kk\times k kernels with dd input channels. However, this classical formulation ignores additional signal-modifying operations such as pooling and padding. The Adaptive Signal Variance (ASV) framework introduces closed-form expressions incorporating pooling corrections:

σw2=Mτ1ε\sigma_w^2 = \frac{M'_\ell}{\tau_{\ell-1}\varepsilon_\ell}

(forward), or

σw2=M1γε\sigma_w^2 = \frac{M_{\ell-1}}{\gamma_\ell \varepsilon_\ell}

(backward), where τ\tau, γ\gamma encode the pooling window effects and ε\varepsilon_\ell the effective connectivity (Henmi et al., 2020).

ASV-backward, in particular, empirically stabilizes gradient flow and enables deeper architectures with non-trivial pooling layers to train without early vanishing/exploding gradients, outperforming classical He initialization.

Transformers

In transformer architectures, He initialization with small σ\sigma (e.g., 0.02) for MLP and Q/K/V projection matrices is the de facto standard, ensuring that all layers begin with weights close to their steady-state distribution. Empirical studies show that shallow transformer layers rapidly expand their weight standard deviation due to larger gradient signal-to-noise, while deeper layers equilibrate more gradually—both settling into narrow “operating bands.” Adjusting initialization scales, warmup schedules, and per-layer scaling is recommended for robust transformer training (Han, 10 Oct 2025).

Graph Neural Networks

The Graph-Init (“G-Init”) extension incorporates local graph degree, initializing each layer’s weights as Var(W())=2di/n\mathrm{Var}(W^{(\ell)}) = 2d_i/n_\ell, where did_i is a proxy for node degree. This adjustment directly addresses the oversmoothing problem in deep GNNs by preserving signal and gradient variance across layers and is empirically validated to maintain accuracy in deep GCNs far beyond the collapse point of standard He initialization (Kelesis et al., 2024).

Quantum Circuits

He initialization is transportable to parameterized quantum circuits (PQCs), where rotation gate angles θk\theta_k are initialized as θkN(0,2/n)\theta_k\sim \mathcal{N}(0,2/n) with nn reflecting relevant circuit fan-in. This increases the initial gradient variance, partially mitigating the barren plateau phenomenon in QNN training and accelerating convergence by ~30% relative to random initialization (Kashif et al., 2023).

4. Principle of Optimality and Alternatives

Viewing stochastic gradient descent (SGD) as a Langevin process, a KL-divergence-based argument produces an explicit bound for the expected final loss in terms of the initialization variance. Minimization of this bound yields an optimal initialization variance matching the long-run per-coordinate variance of the SGD stochastic process:

σ0=Ess[W2]K\sigma_0^\ast = \sqrt{\frac{\mathbb{E}_{ss}[\|W\|^2]}{K}}

where KK is the parameter count. This variance depends on learning rate, batch size, noise level, and local Hessian curvature and is empirically found to further improve training loss and test accuracy beyond He-normal initialization (Horii et al., 18 Aug 2025). This approach generalizes He’s heuristic rule, providing an explicit algorithmic prescription for tuning initialization scales.

5. Mathematical Properties in Deep Overparameterized Regimes

Under He initialization, not only are forward and backward variances approximately preserved in the infinite-width limit, but it can be shown (under modest over-parameterization) that the squared norm of each layer's hidden activation equals the input norm and that the norm of the gradient with respect to each layer’s weights equals the product of the input norm and the output layer error norm—with high probability over the weight initialization. These properties guarantee stable information propagation throughout depth, provided each hidden width scales logarithmically with the number of samples and depth (Arpit et al., 2019).

6. Practical Guidelines and Implementation

Practical recommendations consistently emerge from theoretical and empirical investigations:

  • For ReLU and similar half-rectifier activations in MLPs and CNNs, initialize WN(0,2/nin)W\sim\mathcal{N}(0,2/n_{in}) or, equivalently, using the uniform variant with a=6/nina=\sqrt{6/n_{in}}.
  • In transformer models, prefer WN(0,σ2)W\sim\mathcal{N}(0,\sigma^2) with σ[102,101]\sigma\in[10^{-2},10^{-1}]; σ=0.02\sigma=0.02 is standard for Q/K/V and MLP weights.
  • In GNNs, adopt degree-scaled G-Init: W()N(0,2di/n)W^{(\ell)}\sim\mathcal{N}(0,2d_i/n_\ell) using suitable proxies for node degree.
  • For PQCs, initialize angles θkN(0,2/n)\theta_k\sim\mathcal{N}(0,2/n) where n=n= (rotations per layer) ×\times (# qubits).
  • Adapt initialization scales using SGD dynamics if curvature and optimizer settings are known, using σ0\sigma_0^\ast as a scaling factor on traditional He initialization if desired.
  • Monitor early-training activation variances and adjust initialization if rapid collapse or explosion is observed.
  • In modern CNNs with complex architectures, use ASV-backward or ASV-forward initialization for layers with pooling, padding, or map-size changes; otherwise standard He is sufficient.

7. Limitations and Further Considerations

  • Preservation of total variance by He initialization does not guarantee preservation of sample variance (over data), which can asymptotically decay with depth, causing functional degeneracy of deep random ReLU networks. Batch Normalization or explicit data-dependent centering and scaling are necessary to maintain information-carrying capacity across all layers for deep networks (Luther et al., 2019).
  • In empirical large-scale experiments, alternative initializations (e.g., hull-type methods) can yield lower test error and faster convergence in some settings. Nonetheless, He-normal remains the empirical standard for ReLU networks due to stability and predictable behavior (Steinwart, 2019).
  • For models with activations not exhibiting strong rectification, such as tanh, Xavier initialization (variance 2/(nin+nout)2/(n_{in}+n_{out})) may be preferable.
  • In transformer and GNN training, downstream normalization schemes (LayerNorm, AdamW) interact with initial variance transients and may require fine-tuning of initialization parameters for optimal downstream convergence (Han, 10 Oct 2025, Kelesis et al., 2024).
  • Optimal initialization in the sense of minimizing final loss bound depends on local curvature, optimizer configuration, and SGD hyperparameters—these dependencies are not captured by the classical He prescription (Horii et al., 18 Aug 2025).

Each of these results attests to the criticality of rectifier-aware variance dynamics at initialization in deep learning, the flexibility and limitations of the classic He rule, and the evolution of initialization practices to more complex and theory-guided regimes. The He (Kaiming) initialization remains a cornerstone, but is now frequently refined or adapted to architecture class, objective, and optimizer specification for state-of-the-art model training and stability.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to He or Kaiming Initialization.