Connecting randomized iterative methods with Krylov subspaces
Abstract: Randomized iterative methods, such as the randomized Kaczmarz method, have gained significant attention for solving large-scale linear systems due to their simplicity and efficiency. Meanwhile, Krylov subspace methods have emerged as a powerful class of algorithms, known for their robust theoretical foundations and rapid convergence properties. Despite the individual successes of these two paradigms, their underlying connection has remained largely unexplored. In this paper, we develop a unified framework that bridges randomized iterative methods and Krylov subspace techniques, supported by both rigorous theoretical analysis and practical implementation. The core idea is to formulate each iteration as an adaptively weighted linear combination of the sketched normal vector and previous iterates, with the weights optimally determined via a projection-based mechanism. This formulation not only reveals how subspace techniques can enhance the efficiency of randomized iterative methods, but also enables the design of a new class of iterative-sketching-based Krylov subspace algorithms. We prove that our method converges linearly in expectation and validate our findings with numerical experiments.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about solving big sets of linear equations, written as Ax = b. These show up everywhere—from machine learning to scientific computing. There are two popular ways to solve them:
- Randomized iterative methods, like the randomized Kaczmarz method, which take quick, small steps using random pieces of the data.
- Krylov subspace methods, like CG and GMRES, which use very smart directions and often converge faster.
The paper builds a clear bridge between these two worlds. It shows a single framework that can behave like either method (or a mix of both), and it explains why this works and how to do it efficiently.
Key Questions and Goals
In simple terms, the paper asks:
- Can we connect fast, random step methods with smarter “subspace” methods?
- Can we design a new method that uses the best parts of both?
- Can we prove it gets you to the right answer quickly?
- Can we implement it so each step is cheap and practical for large problems?
How the Method Works (with everyday analogies)
Think of solving Ax = b like trying to guess a secret number by asking questions. You start with a guess x^0. Then:
- You measure how wrong your current guess is. This “wrongness” is called the residual
r = Ax - b. - Instead of looking at the whole residual, you randomly “sketch” it using a small matrix
S. This is like listening to a few helpful hints rather than reading the whole textbook. - You build a small “working space” (an affine subspace) that includes:
- Your recent guesses (you remember the last
ℓof them), and - A new correction direction pointing toward improvement, made from
Sand the residual.
- Your recent guesses (you remember the last
- Inside that small space, you choose the point closest to the true solution. This is done by a projection, like dropping a perpendicular to find the closest point.
Key idea: the paper shows how to get this “closest point” without directly knowing the exact solution (which is called the pseudoinverse solution A^† b). That makes the method practical.
A few more helpful comparisons:
- Randomized iterative method: taking quick steps using randomly sampled information.
- Krylov subspace method: building a smart collection of directions (like a refined toolbox) by repeatedly using the matrix to improve your guess.
- Memory length
ℓ: how many past attempts you remember. More memory can lead to smarter steps.
Main Findings and Why They Matter
Here’s what the authors proved and demonstrated:
- Linear convergence in expectation: On average, each step gets you closer by a fixed percentage. That means predictable, steady progress.
- Using more memory helps: Increasing
ℓ(remembering more past iterates) improves the convergence rate. In other words, remembering more good hints makes you learn faster. - Finite termination when
ℓ ≥ rank(A): If you remember enough (at least as many as the matrix’s rank), the method finishes in a finite number of steps—no round-off errors assumed. - A unifying bridge:
- If
ℓ = 1, the method reduces to familiar randomized methods (like randomized Kaczmarz). - If
ℓ = ∞andS = I(no sketching), the method becomes a Krylov subspace method.
- If
- Efficient implementation: They designed an algorithm where each step is cheap to compute. The cost grows roughly linearly with the problem size and the number of remembered steps. This is key for large-scale problems.
They also propose a new class of iterative-sketching-based Krylov methods (“IS-Krylov”), which build the smart Krylov directions using fresh random sketches at each step. That adds flexibility and can reduce the amount of data processed per step.
Why This Is Important
- Unification leads to new algorithms: By connecting the two families, we can design hybrid methods that are both fast per step (like randomized methods) and smart in direction choice (like Krylov methods).
- Flexibility for big data: Using random sketches means you don’t have to look at the entire dataset in every step—great for large-scale problems in machine learning and scientific computing.
- Practical and scalable: The method avoids expensive computations (like full matrix inverses) and uses limited memory (
ℓ), making it suitable for real-world systems with millions of variables.
Simple Takeaway
Imagine you’re trying to find the best answer as quickly as possible:
- Random methods give you cheap, quick nudges.
- Krylov methods give you powerful, well-aimed pushes.
- This paper shows how to combine both: take smart, quick nudges inside a small remembered space, and do it efficiently.
- The result is a method that steadily gets closer to the right answer, faster if you remember more, and sometimes finishes in a fixed number of steps.
In short, the paper provides a clear, unified way to solve big systems of equations more efficiently, with strong theory and practical algorithms that can make a real difference in modern applications.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of the main unresolved issues and opportunities for further research identified in the paper. Each item is phrased to be concrete and actionable for future work.
- Extension beyond consistent systems: Theoretical guarantees and algorithmic modifications for inconsistent or noisy linear systems (least-squares setting) are missing; key lemmas used (e.g., equivalence of nonzero sketched residual and sketched normal vector) rely on consistency and do not directly generalize.
- Quantifying and optimizing the convergence factor: The rate depends on the term σ_min(H{1/2}A) and the memory-dependent factor q_k, but there are no explicit lower bounds or closed-form estimates for these quantities under common sketching distributions (row/block sampling, Gaussian, SRHT, CountSketch). Deriving distribution-specific bounds and design rules to maximize q_k is an open problem.
- Acceptance probability and rejection sampling cost: The method resamples until S_kT(Axk−b) ≠ 0, yet the expected number of trials and its impact on per-iteration complexity are not analyzed. Bounding P(Q_k) and designing Ω to guarantee high acceptance rates would make the method more predictable.
- Adaptive selection of the memory parameter ℓ: There is no principled strategy for choosing or adapting ℓ online to balance memory, runtime, and convergence speed. Criteria based on curvature/residual trends or saturation detection could improve efficiency.
- Stability and breakdown safeguards: The efficient update uses a Schur complement and the denominator c_k − w_kT h_k; the paper does not analyze near-breakdowns or numerical instabilities (e.g., when this denominator approaches zero) nor propose safeguards (regularization, reorthogonalization, restarts).
- Exploiting structure in C_k: The matrix C_k is tridiagonal, but the complexity accounting treats h_k = C_k w_k as O((k−j_k)2) flops. Leveraging tridiagonal structure should reduce this to O(k−j_k); deriving and implementing true linear-time updates is an open improvement.
- Preconditioning: No preconditioned variants (left/right or normal-equation preconditioning) are developed. Introducing preconditioning into both the randomized and IS-Krylov formulations and analyzing its effect on σ_min(H{1/2}A) and q_k could substantially accelerate convergence.
- Block size q and sketch design: The influence of sketch dimension q on convergence and complexity is not quantified. Guidelines for choosing q (and the sketch type) under sparsity/density constraints to optimize the runtime–accuracy trade-off are needed.
- Impact of finite precision: All orthogonality and affine independence arguments assume exact arithmetic. There is no analysis of loss of orthogonality, drift in subspace quality, or remedies (selective reorthogonalization, basis conditioning) in finite precision.
- Full specification and guarantees for IS-Krylov: The proposed IS-Krylov algorithm (Algorithm 3) is incomplete in the paper, with missing update rules and no convergence guarantees. A complete description (including p_k updates, orthogonalization, restarts) and theoretical analysis (residual minimization properties, rates) remain to be provided.
- Precise connection to CG/GMRES: While the paper claims recovery of Krylov methods when ℓ = ∞ and Ω = {I}, it does not rigorously show when the directions become conjugate (CG for SPD) or when residual minimization (GMRES) is achieved. Formal equivalence conditions and cases where deviations occur are open.
- Stopping criteria: The algorithm references a stopping rule but does not propose robust criteria (e.g., residual-based, error-based, or probabilistic) or analyze their behavior under noise/inconsistency.
- Exploiting sparsity and data locality: Complexity accounts use generic costs W(S_kT A) and W(S_kT b). Concrete sparse implementations (row/block sketches, cache-aware updates, incremental residual maintenance) and their complexity gains are not detailed.
- Distribution design over Ω or {Ω_k}: The paper suggests using time-varying probability spaces but provides no principled design (e.g., leverage-score sampling, adaptive residual-focused sampling, mixing strategies) or convergence analysis under nonstationary Ω_k.
- Global complexity and comparisons: There is no iteration-complexity bound (iterations-to-accuracy) as a function of spectrum, ℓ, q, and sketch distribution, nor theoretical or empirical comparisons against RK, block RK, CG, GMRES, ASHBM, and Gearhart–Koshy accelerations on standard benchmarks.
- Handling rank-deficiency and underdetermined systems: Although finite termination for ℓ ≥ rank(A) is shown in exact arithmetic, the behavior for rank-deficient, underdetermined, or nearly rank-deficient systems (including detection and mitigation of ill-conditioning) is not analyzed.
- Robustness to round-off error in finite termination claims: The finite termination assertion (for ℓ ≥ rank(A)) lacks a finite-precision analysis. Conditions and practical strategies to retain fast termination (e.g., basis conditioning, scaling, restarts) are open.
- Objective variants and regularization: Extensions to regularized problems (ridge/Tikhonov), constrained least squares, or alternative projection targets (e.g., (AT A + λI)−1AT b) are not explored; integrating regularization into the subspace-projection framework is an opportunity.
- High-probability and tail bounds: Convergence is shown in expectation, but high-probability guarantees, concentration results, and robustness to heavy-tailed sketches/residuals are missing.
- Empirical reproducibility: The numerical experiments are referenced but not fully specified in the text provided (data, parameter settings, sketch types, ℓ and q choices). A reproducible evaluation suite with ablations isolating the effects of ℓ, q, and Ω would strengthen practical guidance.
Collections
Sign up for free to add this paper to one or more collections.