Newton–Schulz Iteration in Matrix Computations

Updated 9 January 2026

Newton–Schulz iteration is a matrix polynomial method that computes the inverse, inverse square root, and polar factors of matrices using only multiplications.
The method achieves quadratic or higher-order convergence, significantly accelerating computations in large-scale systems and machine learning applications.
Advanced variants integrate preconditioning, Chebyshev optimization, and parallelism to enhance stability and reduce the computational load per iteration.

The Newton–Schulz iteration is a matrix polynomial method for computing the inverse, inverse square root, and polar factors of matrices using only matrix multiplications. Its efficiency, quadratic or higher-order convergence, and suitability for large-scale parallel systems have made it a core algorithm in numerical linear algebra and machine learning, particularly in applications involving matrix orthogonalization, SPD matrix inversion, and optimization on matrix manifolds. Its contemporary relevance is underscored by recent advances in preconditioning, spectral acceleration, and integration into large-scale computational pipelines (Boissin et al., 4 Dec 2025, Grishina et al., 12 Jun 2025, Stotsky, 2022, Stotsky, 2020, Challacombe et al., 2015).

1. Core Principles and Iteration Schemes

The classical Newton–Schulz (NS) iteration seeks to compute the inverse or inverse square root of a matrix $A\in\mathbb{R}^{n\times n}$ , or the closest orthogonal factor $Q$ (the polar factor) of a general matrix $X$ . For positive definite $A$ , NS can be written as

$X_{k+1}=X_k(3I-A X_k X_k)$

or, in its standard form,

$X_{k+1} = \frac{1}{2} X_k (3I - X_k^T X_k)$

with quadratic convergence provided $\|X_0\|_2\leq 1$ (Boissin et al., 4 Dec 2025). For inversion, the update is typically

$G_k = 2G_{k-1} - G_{k-1} A G_{k-1}$

(Stotsky, 2022, Stotsky, 2020), which is equivalent to

$G_k = G_{k-1}(2I - A G_{k-1})$

error-wise, the method achieves

$\|F_{k+1}\| \leq \|F_k\|^2 \quad \text{with } F_k = I - G_k A,$

indicating error squaring (quadratic convergence) once sufficiently close to the solution. The approach generalizes to higher-order (e.g., cubic or quintic) variants by composing suitable polynomials of the residual.

For the matrix square root and inverse square root, the Newton–Schulz map takes the form

$Q$ 0

(Challacombe et al., 2015), and dual- or single-channel updates may be used for $Q$ 1, converging respectively to $Q$ 2 and $Q$ 3. These iterations are stable and quadratic in the “basin of contraction.”

2. High-Order, Spectral, and Chebyshev-Optimized Variants

High-order Newton–Schulz schemes generalize the quadratic iteration to allow convergence of order $Q$ 4 by utilizing Neumann-series-inspired polynomials: $Q$ 5 with $Q$ 6, achieving $Q$ 7 and thus higher-order error reduction (Stotsky, 2020, Stotsky, 2022). Factorization theory enables reductions in the number of matrix products per iteration by decomposing the power-series polynomials.

Recent advances include Chebyshev-optimized Newton–Schulz (CANS) (Grishina et al., 12 Jun 2025), which systematically minimizes the worst-case approximation error over the residual singular value interval by selecting polynomial coefficients via Chebyshev alternance or Remez algorithms: $Q$ 8 with closed-form expressions for optimal coefficients. CANS significantly accelerates convergence and reduces the number of required matmuls in orthogonalization and Riemannian retractions.

3. Preconditioning, Initialization, and Spectral Scaling

The rate of Newton–Schulz convergence is highly sensitive to the spectrum of the initial iterate. Poor conditioning (large $Q$ 9) slows convergence; improper scaling can even result in divergence.

A notable improvement is the “Almost-Orthogonal Layer” (AOL) preconditioner (Boissin et al., 4 Dec 2025), which replaces naive Frobenius-norm normalization with a data-dependent diagonal scaling: $X$ 0 where $X$ 1. This procedure reduces the initial polar error by an order of magnitude, tightens spectral bounds via the Gershgorin theorem, and enables the quadratic regime to be entered earlier. Preconditioning steps incur negligible overhead (one matrix multiplication reused in the first NS step plus cheap elementwise operations), thus speeding up the entire sequence.

4. Algorithmic Variants, Complexity, and Parallelism

For the classical and high-order NS, each iteration requires several matrix–matrix products. For instance, in the polar factor quintic scheme (Boissin et al., 4 Dec 2025):

One $X$ 2 matmul,
One $X$ 3 matmul,
One $X$ 4 matmul (small dimension).

Total complexity per iteration is $X$ 5, with $X$ 6 multiplies per step in large square matrices. High-order/factorized forms reduce total necessary products per effective update. Unified factorization approaches (Stotsky, 2022, Stotsky, 2020) (e.g., sum-to-factors decomposition) enable multiplications to be minimized or executed in parallel, guided by the efficiency index $X$ 7.

Parallel implementations exploit the compositional structure of high-order schemes, with distinct stages running concurrently. Composite power-series expansions (Stotsky, 2020) allow for further speedup by unrolling multiple inner expansions over different computational units, each tuned to the local architecture.

5. Applications in Optimization and Scientific Computing

The NS iteration is foundational for a variety of computational tasks:

Matrix orthogonalization: Used in optimizers such as Muon and Turbo-Muon for gradient step orthogonalization (Boissin et al., 4 Dec 2025, Grishina et al., 12 Jun 2025).
Riemannian optimization: Polar retraction on the Stiefel manifold leverages NS/CANS as a differentiable and efficient alternative to QR or Cayley retractions (Grishina et al., 12 Jun 2025).
Matrix inversion/parameter estimation: Integrated with Richardson iteration, Newton–Schulz accelerates convergence in solving linear systems and estimation problems, particularly when robust treatment of rank-deficient matrices is required (Stotsky, 2022, Stotsky, 2020).
Sparse/Massively parallel workloads: In settings with structured decay, the combination with SpAMM (Sparse Approximate Matrix Multiply) enables sub-cubic algorithmic complexity for large SPD systems (Challacombe et al., 2015).

6. Numerical Stability, Error Analysis, and Regularization

Quadratic convergence of NS is local; for global reliability, spectral scaling and condition monitoring are required. Error analysis distinguishes between orientational (derivative-driven) and accumulation (approximation/occlusion) errors. Algorithmic modifications, such as dual-channel variants for inverse square root or tighter control of sensitive channels in SpAMM-NS, remedy propagation of instability (Challacombe et al., 2015).

For ill-conditioned scenarios, regularization ( $X$ 8) combined with nested product representations ensures both tractability and accuracy, as large condition numbers can erode the practical gains of approximate matmul-driven NS. This telescoping approach enables each subproblem to be efficiently solved at a lower tolerance, yielding composite solutions with two orders of magnitude reductions in work for extremely ill-conditioned matrices.

7. Empirical Results and Practical Guidance

Empirical benchmarks report:

AOL preconditioning in Turbo-Muon reduces the polar error after the first NS iteration by over an order of magnitude (for $X$ 9), and enables one fewer iteration for comparable precision, leading to 2.8x speedup on A100 bfloat16 hardware (Boissin et al., 4 Dec 2025).
In LLM training, orthogonalization bottlenecks with Muon are reduced from $A$ 0- $A$ 1 to $A$ 2 overhead, delivering $A$ 3- $A$ 4 end-to-end speed gains. Similar improvements are reported for vision-related tasks.
CANS-accelerated polar retraction achieves 30–40% per-epoch time reductions versus QR/Cayley in Wide ResNet/CIFAR-10 training, with equal or better model accuracy (Grishina et al., 12 Jun 2025).
In parameter estimation with failure-prone sensor data, Newton–Schulz-augmented Richardson iteration robustly outperforms LU-based solvers, particularly under extreme rank-deficiency (Stotsky, 2022).

Practical takeaways include the robust applicability of AOL-preconditioned NS as a drop-in replacement (no hyperparameter retuning), immediate and architecture-independent speedups, and preservation of model or system performance for a broad spectrum of gradient and matrix statistics (Gaussian, heavy-tailed, etc.).

Key References:

"Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning" (Boissin et al., 4 Dec 2025)
"Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials" (Grishina et al., 12 Jun 2025)
"Systematic Review of Newton-Schulz Iterations with Unified Factorizations" (Stotsky, 2022)
"Convergence Rate Improvement of Richardson and Newton-Schulz Iterations" (Stotsky, 2020)
"A $A$ 5-Body Solver for Square Root Iteration" (Challacombe et al., 2015)

Markdown Report Issue Upgrade to Chat

References (5)

Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning (2025)

Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials (2025)

Systematic Review of Newton-Schulz Iterations with Unified Factorizations : Integration in the Richardson Method and Application to Robust Failure Detection in Electrical Networks (2022)

Convergence Rate Improvement of Richardson and Newton-Schulz Iterations (2020)

A $N$-Body Solver for Square Root Iteration (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Newton–Schulz Iteration.

Newton–Schulz Iteration in Matrix Computations

1. Core Principles and Iteration Schemes

2. High-Order, Spectral, and Chebyshev-Optimized Variants

3. Preconditioning, Initialization, and Spectral Scaling

4. Algorithmic Variants, Complexity, and Parallelism

5. Applications in Optimization and Scientific Computing

6. Numerical Stability, Error Analysis, and Regularization

7. Empirical Results and Practical Guidance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Newton–Schulz Iteration in Matrix Computations

1. Core Principles and Iteration Schemes

2. High-Order, Spectral, and Chebyshev-Optimized Variants

3. Preconditioning, Initialization, and Spectral Scaling

4. Algorithmic Variants, Complexity, and Parallelism

5. Applications in Optimization and Scientific Computing

6. Numerical Stability, Error Analysis, and Regularization

7. Empirical Results and Practical Guidance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research