Muon-VS: Variance Scaling & Vertex Reconstruction
- Muon-VS is a dual-faceted approach integrating variance-adaptive scaling for LLM pretraining and vertex reconstruction methods for analyzing muon-induced showers.
- It applies variance scaling to matrix-valued momentum updates, improving convergence efficiency in large language model training without extra sensitivity tuning.
- In astroparticle experiments, Muon-VS uses waveform clustering and time-of-flight modeling to accurately localize muon shower vertices, significantly reducing background noise.
Muon-VS refers to two distinct methodologies, each with foundational importance in its respective field. In large-scale LLM optimization, Muon-VS denotes "Variance-Scaled Muon," an optimizer variant introducing variance-adaptive scaling into matrix-valued momentum updates for improved efficiency in model pretraining. Separately, in the context of neutrino and astroparticle experiments, "Muon-VS" appears as an abbreviation in the phrase "Muon shower Vertex reconstruction with waveform information," defining a vertex-reconstruction algorithm critical for reconstructing secondary interaction points (shower vertices) of muons traversing massive scintillator detectors. Both applications center on mathematical strategies for extracting fine-grained structural information from high-dimensional or high-rate data streams but operate in disjoint disciplinary regimes.
1. Variance-Scaled Muon Optimizer: Principle and Motivation
The Variance-Scaled Muon optimizer ("Muon-VS") is an extension of the Muon optimizer designed for LLM pretraining (Li et al., 21 Jan 2026). Its core innovation lies in integrating variance-adaptive scaling, inspired by the perspective that Adam acts as a variance-adaptive sign update—a mechanism that scales coordinate-wise updates according to local noise-to-signal ratio. In standard Muon, the update direction is given by a matrix-valued "sign" (via singular value normalization), treating all update directions equally and thus potentially overshooting in high-variance directions. Muon-VS remediates this by applying per-coordinate variance scaling prior to orthogonalization, dampening noisy updates and producing more stable and efficient convergence dynamics.
2. Mathematical Formulation and Update Rule
The Muon-VS update rule operates on matrix parameters . The construction involves the following steps:
- Exponential Moving Statistics:
- Compute exponential moving averages (EMAs) of the gradients and the pre-orthogonalization variance , involving the squared difference between previous EMA and the current gradient.
Bias corrections and are applied in the usual manner.
- Nesterov-Style Extrapolation:
- Variance-Scaled Normalization:
where is a stability constant.
- Orthogonalization and Parameter Update:
- Apply Newton–Schulz iterations to approximate the orthogonal polar factor .
- Perform the final parameter update using
where is the learning rate, is the weight decay, and is a scale term dependent on parameter matrix size.
No new tunable sensitivity hyperparameter is introduced; serves a similar role to the stabilizer in Adam (Li et al., 21 Jan 2026).
3. Empirical Performance in LLM Pretraining
Muon-VS has been benchmarked on GPT-2 (124M/350M) and LLaMA-1.2B pretraining tasks using large-scale corpora. In these settings:
- Muon-VS exhibits consistently superior wall-clock convergence and sample efficiency compared to Muon, AdaMuon, and AdamW.
- On the LLaMA-1.2B benchmark, Muon-VS and Muon-NSR reduce the number of optimization iterations to reach a target validation loss by a factor of relative to the best-tuned Muon baseline (see Figure 1 of (Li et al., 21 Jan 2026)).
- Ablation studies show that Muon-VS matches the improvements of Muon-NSR but avoids the need for tuning a NSR sensitivity coefficient .
- The improvements persist across a wide range of batch sizes and typical transformer-scale pretraining setups.
These results highlight that variance scaling prior to orthogonalization robustly accelerates and stabilizes pretraining trajectories, providing significant gains without increasing computational complexity (Li et al., 21 Jan 2026).
4. Implementation and Hyperparameter Considerations
Muon-VS preserves most of the original Muon hyperparameter structure. Key points are:
- Required hyperparameters: , , , , .
- Only new term is the stabilizing , typically chosen in the range to depending on floating-point precision.
- No extra buffer or tunable coefficients (e.g., variance dampening factors) beyond those already present in Muon.
- Implementation requires a single additional EMA buffer .
- Main computational cost is a negligible increment for the variance buffer; all spectral and matrix operations are unchanged (Li et al., 21 Jan 2026).
5. Contrasts and Relationship to Related Methods
Muon-VS is part of a family of methods that generalize optimizer updates via matrix-valued transformations. Its relation to other approaches is summarized below:
| Optimizer | Update Direction | Variance/Noise Adaptation | Extra Hyperparameters |
|---|---|---|---|
| AdamW | Coordinate-wise, sign-scaled | Yes (via local RMS) | |
| Muon | Matrix "sign" via normalization | No (equalizes all singular values) | None |
| Muon-NSR | Matrix "sign" + NSR scaling | Yes (noise-to-signal ratio) | Sensitivity |
| Muon-VS | Matrix "sign" + variance scaling | Yes (variance only) | Stabilizer |
Muon-VS allows for soft damping of noisy coordinates, mitigating the risk of overshooting in stochastic regimes, while imposing a consistent spectral structure on parameter updates. The omission of mean-normalization and extra multiplicative sensitivity factors constitutes a simplification that leads to more robust hyperparameter-insensitive performance (Li et al., 21 Jan 2026).
6. Broader Usage: Muon Shower Vertex Reconstruction (JUNO Context)
In high-energy neutrino experiments such as JUNO, "Muon-VS" refers to the methodology for reconstructing muon-induced shower vertices using photomultiplier tube (PMT) waveform information (Zhang, 2022). This technique involves:
- Identifying localized energy-deposition bursts (showers) along through-going muon tracks by detecting multi-peak structures in PMT output.
- Employing time-of-flight modeling and waveform clustering to localize shower vertices with sub-meter spatial resolution.
- Constructing and minimizing a function comparing observed peak times to model predictions for vertex positions.
- Achieving rejection of cosmogenic background isotopes (Li/He) with significantly reduced dead volume compared to full-track-based vetos.
This application, while unrelated to optimizer development, emphasizes "VS" in the context of "Vertex reconstruction with waveform information," demonstrating the versatility of muon-based analysis methods in very different scientific domains (Zhang, 2022).
7. Implications and Outlook
Muon-VS, in its optimizer context, represents a principled fusion of matrix optimization and adaptive normalization, delivering practical gains in LLM pretraining by leveraging variance information while maintaining spectral regularity. The simplicity of implementation, hyperparameter efficiency, and robustness to batch size make it suitable for modern large-scale training pipelines. In experimental astroparticle physics, the Muon-VS (vertex reconstruction) methodology enables precise spatial localization of secondary muon interactions for background rejection, enhancing the scientific reach of large-volume detectors. Across both domains, these Muon-VS approaches exemplify the utility of statistical variance-adaptation and matrix analysis in handling complex, high-dimensional data (Li et al., 21 Jan 2026, Zhang, 2022).