Papers
Topics
Authors
Recent
Search
2000 character limit reached

SOAP: Improving and Stabilizing Shampoo using Adam

Published 17 Sep 2024 in cs.LG and cs.AI | (2409.11321v2)

Abstract: There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: $\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam in the $\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on LLM pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.

Citations (5)

Summary

  • The paper demonstrates that SOAP reduces training iterations by over 40% by performing updates in a rotated eigenbasis that combines Shampoo's second-order information with Adam's adaptive learning rates.
  • It introduces a dual-layer update mechanism that maintains preconditioner matrices while executing efficient AdamW updates, simplifying hyperparameter tuning with a single additional parameter.
  • Experimental results show SOAP outperforms AdamW and the original Shampoo, achieving up to 20% fewer iterations and 15% less wall clock time through optimized preconditioning frequency.

SOAP: Improving and Stabilizing Shampoo using Adam

Introduction

The paper "SOAP: Improving and Stabilizing Shampoo using Adam" proposes a novel optimization algorithm aimed at improving the efficiency of LLM training. This algorithm, referred to as SOAP (ShampoO with Adam in the Preconditioner's eigenbasis), builds on the efficacy of the Shampoo optimizer by addressing its computational overhead and extending its applicability through leveraging the adaptive qualities of Adam in a rotated coordinate space. The synthesis of Shampoo's second-order information with Adam's adaptive learning rates, achieved via SOAP, offers significant reductions in both the number of training iterations and wall clock time required for LLM pre-training, when compared to traditional methods like AdamW and the original Shampoo.

Technical Contributions

One of the paper's primary contributions is establishing a formal connection between Shampoo and Adafactor by demonstrating their equivalence in the eigenspace provided by Shampoo's preconditioner. This insight underpins the derivation of SOAP, which effectively executes AdamW updates in this eigenbasis—a more computationally feasible and efficient method. Crucially, SOAP introduces a single additional hyperparameter, preconditioning frequency, simplifying hyperparameter tuning compared to the multiple parameters required by Shampoo. Figure 1

Figure 1: Comparing performance of tuned runs for AdamW, Shampoo, and SOAP, highlighting SOAP's efficiency advantages.

SOAP's performance was empirically validated on LLM tasks, showcasing a reduction in the required number of iterations by over 40% and a decrease in wall clock time by over 35% compared to AdamW. These improvements are achieved without sacrificing training outcomes, further emphasizing SOAP's potential in real-world applications.

Algorithmic Design

The SOAP algorithm is underpinned by a dual-layer update mechanism: it maintains Shampoo-like second-order information while benefiting from Adam-like adaptivity. Specifically, SOAP performs updates in a rotated space defined by the eigenvectors of the Shampoo preconditioner matrices, LL and RR. These matrices are updated less frequently to reduce computational cost, while Adam's running averages are continually adapted, ensuring robustness across different preconditioning frequencies. Figure 2

Figure 2: Precise efficiency benefits of SOAP over AdamW and Shampoo for different model and batch sizes.

Maintaining LL and RR requires storing and manipulating additional matrices, but the resultant benefits in optimization efficiency justify this overhead. The trade-off between computational savings and memory usage can be managed through various implementation strategies, including the use of reduced precision for certain operations.

Experimental Analysis

SOAP's efficacy was demonstrated through extensive experiments on LLM pre-training tasks, involving models with 360 million and 660 million parameters. Comparisons against AdamW and DistributedShampoo highlighted significant efficiency improvements:

  • Up to 20% fewer iterations and 15% less wall clock time than Shampoo.
  • A consistent performance advantage over a range of batch sizes and preconditioning frequencies.

These results underscore SOAP's robustness and adaptability, particularly in scenarios with large batch sizes where second-order methods are traditionally disadvantaged due to higher per-iteration costs. Figure 3

Figure 3: Performance of SOAP variants focused on optimizing space or time usage, offering insights into potential efficiency improvements.

Implementation Considerations

Implementing SOAP involves careful attention to computational details, such as eigenvector computation frequency and the precision of matrix operations. The algorithm benefits from modern hardware configurations, such as distributed computing environments where the overhead of matrix operations can be distributed across multiple GPUs. Additionally, the paper suggests potential for further optimization through one-sided projections or low-rank approximations for matrix updates, which could yield further savings in both time and space complexities.

Conclusion

The SOAP algorithm offers a compelling advancement in the field of deep learning optimization, particularly for LLM training tasks. By marrying the strengths of Shampoo's second-order preconditioning with Adam's adaptivity, SOAP addresses key challenges in optimization efficiency. Future work could explore enhancements in SOAP's implementation, such as integrating distributed computation or exploring alternative low-precision storage techniques, to further extend its applicability and resource efficiency in large-scale AI applications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

Training a big AI model is like trying to walk downhill to the lowest point on a huge, bumpy landscape. Optimizers are the “walking rules” that decide how big a step to take and in which direction, based on the slope under your feet.

This paper introduces an optimizer called SOAP. It combines two popular ideas:

  • Adam (a fast, simple optimizer)
  • Shampoo (a smarter optimizer that understands the landscape’s shape but is more expensive to run)

SOAP runs Adam in a smart, rotated coordinate system chosen by Shampoo. Think of it like turning your map so that “downhill” lines up neatly with the axes, making your steps more efficient.

Key Questions the Paper Answers

  • Can we connect Shampoo to a simpler method so we keep its benefits but make it cheaper and easier to use?
  • Can we build a practical optimizer (SOAP) that uses Adam in Shampoo’s “best directions” to speed up training?
  • Does SOAP train LLMs faster than AdamW and Shampoo?
  • Is SOAP simpler to tune and more stable when we do the expensive calculations less often?

Methods and Approach (explained simply)

  • The authors first show a neat math connection: Shampoo with a specific setting (using the “1/2 power”) is equivalent to running Adafactor (a memory-saving version of Adam) in Shampoo’s special coordinate system, called its “eigenbasis.”
    • Eigenbasis: Imagine rotating your map so the steepest and flattest directions line up with the x- and y-axes. That makes it easier to choose step sizes in each direction.
    • Preconditioner: A tool that adjusts step sizes depending on direction—smaller steps on steep slopes, bigger steps on gentle slopes.
  • Based on this insight, they design SOAP:
    • Compute the “smart directions” (eigenvectors) from Shampoo every so often, not every step, to save time.
    • At every step, run Adam in this rotated space where directions are lined up nicely.
    • Keep updating “running averages” (Adam’s memory of past gradients) so the optimizer stays adaptive and stable even if the directions change slowly.
    • Rotate the updates back to the original space and apply them to the model weights.
  • Experiments:
    • Train LLMs (about 360 million and 660 million parameters) on standard data, with large batches of tokens.
    • Compare SOAP to AdamW and Shampoo.
    • Measure:
    • Training speed (how many steps needed and total time)
    • Stability when the expensive “direction finding” step is done less often (this is the preconditioning frequency).
    • Use careful tuning and a standard learning rate schedule to make fair comparisons.

Main Findings and Why They Matter

Here are the main results from large-batch training (around 2 million tokens per step):

  • SOAP is faster than AdamW and Shampoo:
    • About 40% fewer training steps than AdamW.
    • About 35% less wall-clock time than AdamW.
    • About 20% fewer steps and time than Shampoo.
  • SOAP stays strong even if you update the “smart directions” less often:
    • When the preconditioning frequency is lower (you compute eigenvectors rarely), Shampoo’s performance drops more noticeably.
    • SOAP’s performance drops much more slowly, so it’s more robust.
  • SOAP is simpler to use:
    • Compared to AdamW, SOAP adds just one extra knob: how often to update the smart directions (the preconditioning frequency).
    • Compared to Shampoo, SOAP has fewer hyperparameters to tune.
  • Smaller batch sizes:
    • With smaller batches (256k tokens), the speedup is smaller but still positive: roughly 25% fewer steps than AdamW and about 10–12% fewer than Shampoo, with around 15% wall-clock improvement over AdamW.
  • Practical engineering choices:
    • They use a faster way to estimate eigenvectors (one-step power iteration plus QR decomposition) rather than a slower, exact method.
    • They explore variations that trade a tiny bit of performance for lower memory and compute, like using Adafactor inside SOAP or rotating only one side of a layer.

These results matter because training LLMs is very expensive. If you can cut training time by 20–40%, you save lots of money and can iterate faster on research.

Implications and Potential Impact

  • Faster, cheaper training: SOAP can reduce the time and compute needed to train big models, which helps both research labs and companies.
  • Simpler and more stable: Because SOAP is robust when you don’t update the “smart directions” very often, it’s easier to deploy at scale.
  • A useful design idea: Running a simple optimizer (like Adam) in a smarter coordinate system (from Shampoo) is a powerful combination. This approach could be applied to other optimizers and tasks.
  • Future improvements: The paper suggests speeding SOAP up even more by using lower precision for certain calculations and better distributed implementations. It could also be tested in areas beyond language, like vision models.

In short, SOAP blends the best of both worlds—Adam’s simplicity and Shampoo’s smarts—to train large models faster with fewer tuning headaches.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

  1. GitHub - nikhilvyas/SOAP (48 stars)  

Tweets

Sign up for free to view the 14 tweets with 889 likes about this paper.