Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

Published 4 Jun 2025 in cs.LG, cs.AI, and stat.ML | (2506.03595v2)

Abstract: The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly eliminates the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that eigenvalue corrections can replace learning rate grafting, leading to improved optimization performance.
It introduces an adaptive strategy using a warm-started QR algorithm to dynamically adjust eigenbasis updates based on training stages.
Empirical results on Imagewoof show that Shampoo variants without grafting can match or surpass traditional performance while enhancing computational efficiency.

Investigating Shampoo's Heuristics by Decomposing its Preconditioner

Introduction

The paper "Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner" (2506.03595) examines the success and complexities surrounding the Shampoo algorithm, a Kronecker-factorization-based optimization method for neural networks. Shampoo's victory in the AlgoPerf contest has rekindled interest in such structured preconditioned gradient techniques over standard methods like Adam. However, its reliance on heuristics like learning rate grafting and outdated preconditioners raises questions about their necessity and optimization.

Decomposition and Heuristics

Shampoo's performance relies heavily on learning rate grafting, a technique to adapt update magnitudes across layers, enhancing stability given the variance in layer-wise eigenspectra and computation infrequency. The paper seeks to decouple and scrutinize the preconditioner's eigenvalues and basis updates, challenging the need for grafting by proposing frequency correction for eigenvalues to offset staleness and scaling issues.

Figure 1: Shampoo with stale preconditioner without grafting shows performance issues with various hyperparameters on Imagewoof, compared to AdamW with grafting.

Experimental Insights and Adaptivity

Empirical investigations demonstrate that optimally correcting eigenvalues can alleviate the need for grafting, as evidenced by experiments comparing Shampoo variants. These alternatives not only match but sometimes surpass the Shampoo performance with grafting, indicating potential simplification of the algorithm's hyperparameters and its broader applicability without grafting under certain conditions.

Figure 2: Training results comparing different Shampoo variants on Imagewoof, highlighting superiority in training loss and learning rate transfer across models.

Eigenbasis Update Strategy

The paper proposes a novel adaptive strategy for controlling eigenbasis approximation errors, leveraging a warm-started QR algorithm with a relative error criterion to dynamically adjust computation frequency based on training stages. This approach tailors preconditioner update frequencies to parameter identities, considerably impacting computational efficiency and convergence, particularly evident in contrast with fixed-frequency approaches as shown in the benchmarks.

Figure 3: Adaptive eigendecomposition in EShampoo improves wall-clock efficiency on training tasks compared to static methods.

Conclusions and Future Directions

The study conducts a comprehensive evaluation of Shampoo’s heuristics, providing a principled pathway to enhance Shampoo’s algorithmic framework by eliminating grafting and optimizing update strategies. This fosters broader implications for structured gradient methods and scalability across diverse model architectures. Moving forward, it raises prospects for exploring adaptivity in other optimization strategies like K-FAC and further theoretical validations to integrate approximation quality into regret bounds.

Figure 4: Analysis indicating variable update frequencies of different preconditioner matrices within a model, highlighting distinct computational dynamics.

Overall, this endeavor not only demystifies heuristics of Kronecker-factored optimizers but sets a cornerstone for future research aimed at refining large-scale learning algorithms, particularly in enhancing the balance between convergence quality and computational cost.