Papers
Topics
Authors
Recent
Search
2000 character limit reached

Muon-VS: Variance Scaling & Vertex Reconstruction

Updated 28 January 2026
  • Muon-VS is a dual-faceted approach integrating variance-adaptive scaling for LLM pretraining and vertex reconstruction methods for analyzing muon-induced showers.
  • It applies variance scaling to matrix-valued momentum updates, improving convergence efficiency in large language model training without extra sensitivity tuning.
  • In astroparticle experiments, Muon-VS uses waveform clustering and time-of-flight modeling to accurately localize muon shower vertices, significantly reducing background noise.

Muon-VS refers to two distinct methodologies, each with foundational importance in its respective field. In large-scale LLM optimization, Muon-VS denotes "Variance-Scaled Muon," an optimizer variant introducing variance-adaptive scaling into matrix-valued momentum updates for improved efficiency in model pretraining. Separately, in the context of neutrino and astroparticle experiments, "Muon-VS" appears as an abbreviation in the phrase "Muon shower Vertex reconstruction with waveform information," defining a vertex-reconstruction algorithm critical for reconstructing secondary interaction points (shower vertices) of muons traversing massive scintillator detectors. Both applications center on mathematical strategies for extracting fine-grained structural information from high-dimensional or high-rate data streams but operate in disjoint disciplinary regimes.

1. Variance-Scaled Muon Optimizer: Principle and Motivation

The Variance-Scaled Muon optimizer ("Muon-VS") is an extension of the Muon optimizer designed for LLM pretraining (Li et al., 21 Jan 2026). Its core innovation lies in integrating variance-adaptive scaling, inspired by the perspective that Adam acts as a variance-adaptive sign update—a mechanism that scales coordinate-wise updates according to local noise-to-signal ratio. In standard Muon, the update direction is given by a matrix-valued "sign" (via singular value normalization), treating all update directions equally and thus potentially overshooting in high-variance directions. Muon-VS remediates this by applying per-coordinate variance scaling prior to orthogonalization, dampening noisy updates and producing more stable and efficient convergence dynamics.

2. Mathematical Formulation and Update Rule

The Muon-VS update rule operates on matrix parameters WtRm×nW_t \in \mathbb{R}^{m \times n}. The construction involves the following steps:

  1. Exponential Moving Statistics:
    • Compute exponential moving averages (EMAs) of the gradients MtM_t and the pre-orthogonalization variance Γt\Gamma_t, involving the squared difference between previous EMA and the current gradient.

Mt=βMt1+(1β)Gt Γt=βΓt1+β(1β)(Mt1Gt)2\begin{aligned} M_t &= \beta\,M_{t-1} + (1-\beta)\,G_t \ \Gamma_t &= \beta\,\Gamma_{t-1} + \beta(1-\beta)\,(M_{t-1}-G_t)^{\odot 2} \end{aligned}

Bias corrections M^t\widehat{M}_t and Γ^t\widehat{\Gamma}_t are applied in the usual manner.

  1. Nesterov-Style Extrapolation:

M~t=Gt+β1βM^t\widetilde{M}_t = G_t + \frac{\beta}{1-\beta}\,\widehat{M}_t

  1. Variance-Scaled Normalization:

MVS,t=M~tΓ^t+ε\overline{M}_{VS,t} = \frac{\widetilde{M}_t}{\sqrt{\widehat{\Gamma}_t} + \varepsilon}

where ε\varepsilon is a stability constant.

  1. Orthogonalization and Parameter Update:
    • Apply KK Newton–Schulz iterations to approximate the orthogonal polar factor Ot=NSK(MVS,t)O_t = \mathbf{NS}_K(\overline{M}_{VS,t}).
    • Perform the final parameter update using

WtWt1(1ηλ)ηsscaleOtW_t \leftarrow W_{t-1}(1-\eta\lambda) - \eta\,s_{\mathrm{scale}}\,O_t

where η\eta is the learning rate, λ\lambda is the weight decay, and sscales_{\mathrm{scale}} is a scale term dependent on parameter matrix size.

No new tunable sensitivity hyperparameter is introduced; ε\varepsilon serves a similar role to the stabilizer in Adam (Li et al., 21 Jan 2026).

3. Empirical Performance in LLM Pretraining

Muon-VS has been benchmarked on GPT-2 (124M/350M) and LLaMA-1.2B pretraining tasks using large-scale corpora. In these settings:

  • Muon-VS exhibits consistently superior wall-clock convergence and sample efficiency compared to Muon, AdaMuon, and AdamW.
  • On the LLaMA-1.2B benchmark, Muon-VS and Muon-NSR reduce the number of optimization iterations to reach a target validation loss by a factor of 1.36×1.36\times relative to the best-tuned Muon baseline (see Figure 1 of (Li et al., 21 Jan 2026)).
  • Ablation studies show that Muon-VS matches the improvements of Muon-NSR but avoids the need for tuning a NSR sensitivity coefficient γ\gamma.
  • The improvements persist across a wide range of batch sizes and typical transformer-scale pretraining setups.

These results highlight that variance scaling prior to orthogonalization robustly accelerates and stabilizes pretraining trajectories, providing significant gains without increasing computational complexity (Li et al., 21 Jan 2026).

4. Implementation and Hyperparameter Considerations

Muon-VS preserves most of the original Muon hyperparameter structure. Key points are:

  • Required hyperparameters: β\beta, η\eta, λ\lambda, KK, sscales_{\mathrm{scale}}.
  • Only new term is the stabilizing ε\varepsilon, typically chosen in the range 10810^{-8} to 102510^{-25} depending on floating-point precision.
  • No extra buffer or tunable coefficients (e.g., variance dampening factors) beyond those already present in Muon.
  • Implementation requires a single additional EMA buffer Γt\Gamma_t.
  • Main computational cost is a negligible increment for the variance buffer; all spectral and matrix operations are unchanged (Li et al., 21 Jan 2026).

Muon-VS is part of a family of methods that generalize optimizer updates via matrix-valued transformations. Its relation to other approaches is summarized below:

Optimizer Update Direction Variance/Noise Adaptation Extra Hyperparameters
AdamW Coordinate-wise, sign-scaled Yes (via local RMS) ϵ\epsilon
Muon Matrix "sign" via normalization No (equalizes all singular values) None
Muon-NSR Matrix "sign" + NSR scaling Yes (noise-to-signal ratio) Sensitivity γ\gamma
Muon-VS Matrix "sign" + variance scaling Yes (variance only) Stabilizer ϵ\epsilon

Muon-VS allows for soft damping of noisy coordinates, mitigating the risk of overshooting in stochastic regimes, while imposing a consistent spectral structure on parameter updates. The omission of mean-normalization and extra multiplicative sensitivity factors constitutes a simplification that leads to more robust hyperparameter-insensitive performance (Li et al., 21 Jan 2026).

6. Broader Usage: Muon Shower Vertex Reconstruction (JUNO Context)

In high-energy neutrino experiments such as JUNO, "Muon-VS" refers to the methodology for reconstructing muon-induced shower vertices using photomultiplier tube (PMT) waveform information (Zhang, 2022). This technique involves:

  • Identifying localized energy-deposition bursts (showers) along through-going muon tracks by detecting multi-peak structures in PMT output.
  • Employing time-of-flight modeling and waveform clustering to localize shower vertices with sub-meter spatial resolution.
  • Constructing and minimizing a χ2\chi^2 function comparing observed peak times to model predictions for vertex positions.
  • Achieving >94%>94\% rejection of cosmogenic background isotopes (9^9Li/8^8He) with significantly reduced dead volume compared to full-track-based vetos.

This application, while unrelated to optimizer development, emphasizes "VS" in the context of "Vertex reconstruction with waveform information," demonstrating the versatility of muon-based analysis methods in very different scientific domains (Zhang, 2022).

7. Implications and Outlook

Muon-VS, in its optimizer context, represents a principled fusion of matrix optimization and adaptive normalization, delivering practical gains in LLM pretraining by leveraging variance information while maintaining spectral regularity. The simplicity of implementation, hyperparameter efficiency, and robustness to batch size make it suitable for modern large-scale training pipelines. In experimental astroparticle physics, the Muon-VS (vertex reconstruction) methodology enables precise spatial localization of secondary muon interactions for background rejection, enhancing the scientific reach of large-volume detectors. Across both domains, these Muon-VS approaches exemplify the utility of statistical variance-adaptation and matrix analysis in handling complex, high-dimensional data (Li et al., 21 Jan 2026, Zhang, 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Muon-VS.