A priori determination of Newton iterations for parallelizing non-linear RNNs

Determine, for the Newton’s method approach that parallelizes non-linear recurrent neural networks by casting the recurrence across a length‑L sequence as a system of L non-linear equations (as in Danieli et al., 2025), the number of Newton iterations required for convergence for a specified non-linear RNN architecture before running the algorithm, so that the compute cost can be assessed a priori.

Background

Non-linear RNNs cannot use the parallel scan algorithm across the sequence length due to the non-linearity breaking associativity, which typically forces sequential computation. A recent line of work proposes parallelizing by reformulating the recurrence as a system of non-linear equations and solving it with Newton’s method.

However, this Newton-based approach introduces practical challenges: storing Jacobians can be memory intensive, and, crucially, the number of Newton iterations needed to achieve convergence is not known in advance for a given architecture, making it difficult to plan computation and compare against straightforward sequential execution.

References

Moreover, each forward pass requires multiple Newton iterations, and the number of iterations needed for convergence is not known a priori for a given architecture.

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling  (2603.14360 - Mishra et al., 15 Mar 2026) in Limitations of Non-Linear RNNs — Training Inefficiency: Non-Parallelizability on Sequence Length