Muon-VS: Variance Scaling & Vertex Reconstruction

Updated 28 January 2026

Muon-VS is a dual-faceted approach integrating variance-adaptive scaling for LLM pretraining and vertex reconstruction methods for analyzing muon-induced showers.
It applies variance scaling to matrix-valued momentum updates, improving convergence efficiency in large language model training without extra sensitivity tuning.
In astroparticle experiments, Muon-VS uses waveform clustering and time-of-flight modeling to accurately localize muon shower vertices, significantly reducing background noise.

Muon-VS refers to two distinct methodologies, each with foundational importance in its respective field. In large-scale LLM optimization, Muon-VS denotes "Variance-Scaled Muon," an optimizer variant introducing variance-adaptive scaling into matrix-valued momentum updates for improved efficiency in model pretraining. Separately, in the context of neutrino and astroparticle experiments, "Muon-VS" appears as an abbreviation in the phrase "Muon shower Vertex reconstruction with waveform information," defining a vertex-reconstruction algorithm critical for reconstructing secondary interaction points (shower vertices) of muons traversing massive scintillator detectors. Both applications center on mathematical strategies for extracting fine-grained structural information from high-dimensional or high-rate data streams but operate in disjoint disciplinary regimes.

1. Variance-Scaled Muon Optimizer: Principle and Motivation

The Variance-Scaled Muon optimizer ("Muon-VS") is an extension of the Muon optimizer designed for LLM pretraining (Li et al., 21 Jan 2026). Its core innovation lies in integrating variance-adaptive scaling, inspired by the perspective that Adam acts as a variance-adaptive sign update—a mechanism that scales coordinate-wise updates according to local noise-to-signal ratio. In standard Muon, the update direction is given by a matrix-valued "sign" (via singular value normalization), treating all update directions equally and thus potentially overshooting in high-variance directions. Muon-VS remediates this by applying per-coordinate variance scaling prior to orthogonalization, dampening noisy updates and producing more stable and efficient convergence dynamics.

2. Mathematical Formulation and Update Rule

The Muon-VS update rule operates on matrix parameters $W_t \in \mathbb{R}^{m \times n}$ . The construction involves the following steps:

Exponential Moving Statistics:
- Compute exponential moving averages (EMAs) of the gradients $M_t$ and the pre-orthogonalization variance $\Gamma_t$ , involving the squared difference between previous EMA and the current gradient.

$\begin{aligned} M_t &= \beta\,M_{t-1} + (1-\beta)\,G_t \ \Gamma_t &= \beta\,\Gamma_{t-1} + \beta(1-\beta)\,(M_{t-1}-G_t)^{\odot 2} \end{aligned}$

Bias corrections $\widehat{M}_t$ and $\widehat{\Gamma}_t$ are applied in the usual manner.

Nesterov-Style Extrapolation:

$\widetilde{M}_t = G_t + \frac{\beta}{1-\beta}\,\widehat{M}_t$

Variance-Scaled Normalization:

$\overline{M}_{VS,t} = \frac{\widetilde{M}_t}{\sqrt{\widehat{\Gamma}_t} + \varepsilon}$

where $\varepsilon$ is a stability constant.

Orthogonalization and Parameter Update:
- Apply $K$ Newton–Schulz iterations to approximate the orthogonal polar factor $O_t = \mathbf{NS}_K(\overline{M}_{VS,t})$ .
- Perform the final parameter update using

$W_t \leftarrow W_{t-1}(1-\eta\lambda) - \eta\,s_{\mathrm{scale}}\,O_t$

where $\eta$ is the learning rate, $\lambda$ is the weight decay, and $s_{\mathrm{scale}}$ is a scale term dependent on parameter matrix size.

No new tunable sensitivity hyperparameter is introduced; $\varepsilon$ serves a similar role to the stabilizer in Adam (Li et al., 21 Jan 2026).

3. Empirical Performance in LLM Pretraining

Muon-VS has been benchmarked on GPT-2 (124M/350M) and LLaMA-1.2B pretraining tasks using large-scale corpora. In these settings:

Muon-VS exhibits consistently superior wall-clock convergence and sample efficiency compared to Muon, AdaMuon, and AdamW.
On the LLaMA-1.2B benchmark, Muon-VS and Muon-NSR reduce the number of optimization iterations to reach a target validation loss by a factor of $1.36\times$ relative to the best-tuned Muon baseline (see Figure 1 of (Li et al., 21 Jan 2026)).
Ablation studies show that Muon-VS matches the improvements of Muon-NSR but avoids the need for tuning a NSR sensitivity coefficient $\gamma$ .
The improvements persist across a wide range of batch sizes and typical transformer-scale pretraining setups.

These results highlight that variance scaling prior to orthogonalization robustly accelerates and stabilizes pretraining trajectories, providing significant gains without increasing computational complexity (Li et al., 21 Jan 2026).

4. Implementation and Hyperparameter Considerations

Muon-VS preserves most of the original Muon hyperparameter structure. Key points are:

Required hyperparameters: $\beta$ , $\eta$ , $\lambda$ , $K$ , $s_{\mathrm{scale}}$ .
Only new term is the stabilizing $\varepsilon$ , typically chosen in the range $10^{-8}$ to $10^{-25}$ depending on floating-point precision.
No extra buffer or tunable coefficients (e.g., variance dampening factors) beyond those already present in Muon.
Implementation requires a single additional EMA buffer $\Gamma_t$ .
Main computational cost is a negligible increment for the variance buffer; all spectral and matrix operations are unchanged (Li et al., 21 Jan 2026).

Muon-VS is part of a family of methods that generalize optimizer updates via matrix-valued transformations. Its relation to other approaches is summarized below:

Optimizer	Update Direction	Variance/Noise Adaptation	Extra Hyperparameters
AdamW	Coordinate-wise, sign-scaled	Yes (via local RMS)	$\epsilon$
Muon	Matrix "sign" via normalization	No (equalizes all singular values)	None
Muon-NSR	Matrix "sign" + NSR scaling	Yes (noise-to-signal ratio)	Sensitivity $\gamma$
Muon-VS	Matrix "sign" + variance scaling	Yes (variance only)	Stabilizer $\epsilon$

Muon-VS allows for soft damping of noisy coordinates, mitigating the risk of overshooting in stochastic regimes, while imposing a consistent spectral structure on parameter updates. The omission of mean-normalization and extra multiplicative sensitivity factors constitutes a simplification that leads to more robust hyperparameter-insensitive performance (Li et al., 21 Jan 2026).

6. Broader Usage: Muon Shower Vertex Reconstruction (JUNO Context)

In high-energy neutrino experiments such as JUNO, "Muon-VS" refers to the methodology for reconstructing muon-induced shower vertices using photomultiplier tube (PMT) waveform information (Zhang, 2022). This technique involves:

Identifying localized energy-deposition bursts (showers) along through-going muon tracks by detecting multi-peak structures in PMT output.
Employing time-of-flight modeling and waveform clustering to localize shower vertices with sub-meter spatial resolution.
Constructing and minimizing a $\chi^2$ function comparing observed peak times to model predictions for vertex positions.
Achieving $>94\%$ rejection of cosmogenic background isotopes ( $^9$ Li/ $^8$ He) with significantly reduced dead volume compared to full-track-based vetos.

This application, while unrelated to optimizer development, emphasizes "VS" in the context of "Vertex reconstruction with waveform information," demonstrating the versatility of muon-based analysis methods in very different scientific domains (Zhang, 2022).

7. Implications and Outlook

Muon-VS, in its optimizer context, represents a principled fusion of matrix optimization and adaptive normalization, delivering practical gains in LLM pretraining by leveraging variance information while maintaining spectral regularity. The simplicity of implementation, hyperparameter efficiency, and robustness to batch size make it suitable for modern large-scale training pipelines. In experimental astroparticle physics, the Muon-VS (vertex reconstruction) methodology enables precise spatial localization of secondary muon interactions for background rejection, enhancing the scientific reach of large-volume detectors. Across both domains, these Muon-VS approaches exemplify the utility of statistical variance-adaptation and matrix analysis in handling complex, high-dimensional data (Li et al., 21 Jan 2026, Zhang, 2022).

Markdown Report Issue Upgrade to Chat

References (2)

Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum (2026)

Muon shower vertex reconstruction with waveform information in JUNO (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Muon-VS.

Muon-VS: Variance Scaling & Vertex Reconstruction

1. Variance-Scaled Muon Optimizer: Principle and Motivation

2. Mathematical Formulation and Update Rule

3. Empirical Performance in LLM Pretraining

4. Implementation and Hyperparameter Considerations

6. Broader Usage: Muon Shower Vertex Reconstruction (JUNO Context)

7. Implications and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Muon-VS: Variance Scaling & Vertex Reconstruction

1. Variance-Scaled Muon Optimizer: Principle and Motivation

2. Mathematical Formulation and Update Rule

3. Empirical Performance in LLM Pretraining

4. Implementation and Hyperparameter Considerations

5. Contrasts and Relationship to Related Methods

6. Broader Usage: Muon Shower Vertex Reconstruction (JUNO Context)

7. Implications and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research