Papers
Topics
Authors
Recent
Search
2000 character limit reached

Newton–Schulz Iteration in Matrix Computations

Updated 9 January 2026
  • Newton–Schulz iteration is a matrix polynomial method that computes the inverse, inverse square root, and polar factors of matrices using only multiplications.
  • The method achieves quadratic or higher-order convergence, significantly accelerating computations in large-scale systems and machine learning applications.
  • Advanced variants integrate preconditioning, Chebyshev optimization, and parallelism to enhance stability and reduce the computational load per iteration.

The Newton–Schulz iteration is a matrix polynomial method for computing the inverse, inverse square root, and polar factors of matrices using only matrix multiplications. Its efficiency, quadratic or higher-order convergence, and suitability for large-scale parallel systems have made it a core algorithm in numerical linear algebra and machine learning, particularly in applications involving matrix orthogonalization, SPD matrix inversion, and optimization on matrix manifolds. Its contemporary relevance is underscored by recent advances in preconditioning, spectral acceleration, and integration into large-scale computational pipelines (Boissin et al., 4 Dec 2025, Grishina et al., 12 Jun 2025, Stotsky, 2022, Stotsky, 2020, Challacombe et al., 2015).

1. Core Principles and Iteration Schemes

The classical Newton–Schulz (NS) iteration seeks to compute the inverse or inverse square root of a matrix ARn×nA\in\mathbb{R}^{n\times n}, or the closest orthogonal factor QQ (the polar factor) of a general matrix XX. For positive definite AA, NS can be written as

Xk+1=Xk(3IAXkXk)X_{k+1}=X_k(3I-A X_k X_k)

or, in its standard form,

Xk+1=12Xk(3IXkTXk)X_{k+1} = \frac{1}{2} X_k (3I - X_k^T X_k)

with quadratic convergence provided X021\|X_0\|_2\leq 1 (Boissin et al., 4 Dec 2025). For inversion, the update is typically

Gk=2Gk1Gk1AGk1G_k = 2G_{k-1} - G_{k-1} A G_{k-1}

(Stotsky, 2022, Stotsky, 2020), which is equivalent to

Gk=Gk1(2IAGk1)G_k = G_{k-1}(2I - A G_{k-1})

error-wise, the method achieves

Fk+1Fk2with Fk=IGkA,\|F_{k+1}\| \leq \|F_k\|^2 \quad \text{with } F_k = I - G_k A,

indicating error squaring (quadratic convergence) once sufficiently close to the solution. The approach generalizes to higher-order (e.g., cubic or quintic) variants by composing suitable polynomials of the residual.

For the matrix square root and inverse square root, the Newton–Schulz map takes the form

hα(X)=α2(3IαX)h_\alpha(X) = \frac{\sqrt{\alpha}}{2}(3I - \alpha X)

(Challacombe et al., 2015), and dual- or single-channel updates may be used for (Yk,Zk)(Y_k, Z_k), converging respectively to A1/2A^{1/2} and A1/2A^{-1/2}. These iterations are stable and quadratic in the “basin of contraction.”

2. High-Order, Spectral, and Chebyshev-Optimized Variants

High-order Newton–Schulz schemes generalize the quadratic iteration to allow convergence of order nn by utilizing Neumann-series-inspired polynomials: Gk=(j=0n1Fk1j)Gk1G_k = \left( \sum_{j=0}^{n-1} F_{k-1}^j \right) G_{k-1} with Fk1=IGk1AF_{k-1}=I-G_{k-1}A, achieving Fk=Fk1nF_k = F_{k-1}^n and thus higher-order error reduction (Stotsky, 2020, Stotsky, 2022). Factorization theory enables reductions in the number of matrix products per iteration by decomposing the power-series polynomials.

Recent advances include Chebyshev-optimized Newton–Schulz (CANS) (Grishina et al., 12 Jun 2025), which systematically minimizes the worst-case approximation error over the residual singular value interval by selecting polynomial coefficients via Chebyshev alternance or Remez algorithms: p(x)=α1x+α3x3(degree 3)p(x) = \alpha_1 x + \alpha_3 x^3\quad \text{(degree 3)} with closed-form expressions for optimal coefficients. CANS significantly accelerates convergence and reduces the number of required matmuls in orthogonalization and Riemannian retractions.

3. Preconditioning, Initialization, and Spectral Scaling

The rate of Newton–Schulz convergence is highly sensitive to the spectrum of the initial iterate. Poor conditioning (large κ(X0)=σmax/σmin\kappa(X_0) = \sigma_{\max} / \sigma_{\min}) slows convergence; improper scaling can even result in divergence.

A notable improvement is the “Almost-Orthogonal Layer” (AOL) preconditioner (Boissin et al., 4 Dec 2025), which replaces naive Frobenius-norm normalization with a data-dependent diagonal scaling: si=(j=1nA0ij)1/2,X1=X0diag(s)s_i = \left( \sum_{j=1}^n |A_0^{ij}| \right)^{-1/2},\quad X_1 = X_0 \operatorname{diag}(s) where A0=X0TX0A_0 = X_0^T X_0. This procedure reduces the initial polar error by an order of magnitude, tightens spectral bounds via the Gershgorin theorem, and enables the quadratic regime to be entered earlier. Preconditioning steps incur negligible overhead (one matrix multiplication reused in the first NS step plus cheap elementwise operations), thus speeding up the entire sequence.

4. Algorithmic Variants, Complexity, and Parallelism

For the classical and high-order NS, each iteration requires several matrix–matrix products. For instance, in the polar factor quintic scheme (Boissin et al., 4 Dec 2025):

  • One XkTXkX_k^T X_k matmul,
  • One XkBkX_k B_k matmul,
  • One Ak2A_k^2 matmul (small dimension).

Total complexity per iteration is O(n3)O(n^3), with $3$ multiplies per step in large square matrices. High-order/factorized forms reduce total necessary products per effective update. Unified factorization approaches (Stotsky, 2022, Stotsky, 2020) (e.g., sum-to-factors decomposition) enable multiplications to be minimized or executed in parallel, guided by the efficiency index EI=n1/#products\mathrm{EI}=n^{1/\mathrm{\#products}}.

Parallel implementations exploit the compositional structure of high-order schemes, with distinct stages running concurrently. Composite power-series expansions (Stotsky, 2020) allow for further speedup by unrolling multiple inner expansions over different computational units, each tuned to the local architecture.

5. Applications in Optimization and Scientific Computing

The NS iteration is foundational for a variety of computational tasks:

  • Matrix orthogonalization: Used in optimizers such as Muon and Turbo-Muon for gradient step orthogonalization (Boissin et al., 4 Dec 2025, Grishina et al., 12 Jun 2025).
  • Riemannian optimization: Polar retraction on the Stiefel manifold leverages NS/CANS as a differentiable and efficient alternative to QR or Cayley retractions (Grishina et al., 12 Jun 2025).
  • Matrix inversion/parameter estimation: Integrated with Richardson iteration, Newton–Schulz accelerates convergence in solving linear systems and estimation problems, particularly when robust treatment of rank-deficient matrices is required (Stotsky, 2022, Stotsky, 2020).
  • Sparse/Massively parallel workloads: In settings with structured decay, the combination with SpAMM (Sparse Approximate Matrix Multiply) enables sub-cubic algorithmic complexity for large SPD systems (Challacombe et al., 2015).

6. Numerical Stability, Error Analysis, and Regularization

Quadratic convergence of NS is local; for global reliability, spectral scaling and condition monitoring are required. Error analysis distinguishes between orientational (derivative-driven) and accumulation (approximation/occlusion) errors. Algorithmic modifications, such as dual-channel variants for inverse square root or tighter control of sensitive channels in SpAMM-NS, remedy propagation of instability (Challacombe et al., 2015).

For ill-conditioned scenarios, regularization (Aμ=A+μIA_\mu = A + \mu I) combined with nested product representations ensures both tractability and accuracy, as large condition numbers can erode the practical gains of approximate matmul-driven NS. This telescoping approach enables each subproblem to be efficiently solved at a lower tolerance, yielding composite solutions with two orders of magnitude reductions in work for extremely ill-conditioned matrices.

7. Empirical Results and Practical Guidance

Empirical benchmarks report:

  • AOL preconditioning in Turbo-Muon reduces the polar error after the first NS iteration by over an order of magnitude (for n=8192n=8192), and enables one fewer iteration for comparable precision, leading to 2.8x speedup on A100 bfloat16 hardware (Boissin et al., 4 Dec 2025).
  • In LLM training, orthogonalization bottlenecks with Muon are reduced from $10$-20%20\% to 3%3\% overhead, delivering $8$-10%10\% end-to-end speed gains. Similar improvements are reported for vision-related tasks.
  • CANS-accelerated polar retraction achieves 30–40% per-epoch time reductions versus QR/Cayley in Wide ResNet/CIFAR-10 training, with equal or better model accuracy (Grishina et al., 12 Jun 2025).
  • In parameter estimation with failure-prone sensor data, Newton–Schulz-augmented Richardson iteration robustly outperforms LU-based solvers, particularly under extreme rank-deficiency (Stotsky, 2022).

Practical takeaways include the robust applicability of AOL-preconditioned NS as a drop-in replacement (no hyperparameter retuning), immediate and architecture-independent speedups, and preservation of model or system performance for a broad spectrum of gradient and matrix statistics (Gaussian, heavy-tailed, etc.).


Key References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Newton–Schulz Iteration.