MRRR: Robust Eigenpair Computation
- Multiple Relatively Robust Representations (MRRR) is an algorithmic framework that efficiently solves symmetric tridiagonal eigenproblems using recursively generated robust representations.
- It leverages a spectrum-shifting and task-queue paradigm to break eigenvalue clusters, achieving O(kn) computational complexity and excellent parallel scalability.
- Mixed-precision implementations within MRRR enhance numerical stability and orthogonality, offering accuracy comparable to traditional methods while maintaining speed.
Multiple Relatively Robust Representations (MRRR) is an algorithmic framework for the efficient solution of the real symmetric tridiagonal eigenproblem (STEP), a key subproblem arising in the computation of eigenvalues and eigenvectors of dense Hermitian matrices after tridiagonal reduction. MRRR is distinguished by its use of recursively generated, carefully factored matrix representations—each being a "Relatively Robust Representation" (RRR)—to extract eigenpairs accurately and efficiently. Its primary contributions are a computational complexity of for eigenpairs of an matrix (with in typical full-spectrum computations), modest memory requirements, and strong parallelizability. MRRR achieves these through a spectrum-shifting and task-queue paradigm that breaks clusters of close eigenvalues via recursively constructed RRRs. The method’s accuracy, performance, and scalability have been the subject of extensive theoretical and empirical investigation, especially in recent mixed-precision variants.
1. Mathematical Foundations and Classical Algorithm
The STEP requires the solution of
for a real, symmetric, tridiagonal . The MRRR approach is built upon the following core concepts:
- Relatively Robust Representation (RRR): An RRR for a subset of eigenvalues indexed by is a factored form of a shifted tridiagonal,
where, for any small element-wise relative perturbation , the associated eigenvalues and invariant subspaces in remain well-conditioned. Specifically,
for moderate and small .
- Algorithmic Structure: MRRR constructs a queue of "tasks," each consisting of a representation and index set . Singleton tasks (with ) yield eigenpairs via Rayleigh-quotient iteration (RQI) and twisted factorizations; clustered tasks () trigger a shift, produce a new RRR, and subdivide the cluster by recursively building child tasks.
Algorithmic steps:
- Preprocessing: scale and split at tiny off-diagonals.
- Choose an initial root shift , form , perturb.
- Compute eigenvalue estimates for to relative accuracy set by .
- Initialize a task queue with .
- Iteratively process queue:
- Partition into clusters according to .
- For a singleton, execute RQI and back-shift.
- For a cluster, select shift , form new RRR and enqueue subtask.
The arithmetic complexity is for eigenpairs, with auxiliary storage, and the algorithm naturally supports both depth- and breadth-first traversals of the computation tree (Petschow et al., 2013, Petschow, 2014).
2. Theoretical Properties and Error Analysis
The quality and stability of MRRR depend on several analytical guarantees:
- Residual and Orthogonality Bounds: If all RRRs meet relative robustness and conditional element growth, for unit roundoff and deepest cluster tree depth ,
where sets cluster splitting sensitivity, and is the local residual (cf. (Petschow et al., 2013, Petschow, 2014)).
- Comparison to Other Methods: Classical MRRR yields worst-case orthogonality , as opposed to for QR and Divide–Conquer methods. In empirical tests, MRRR may lose orthogonality up to , while alternatives remain below (Petschow et al., 2013).
- Cluster Control via : Fine tuning manages the trade-off between breaking clusters (improving performance and parallelism) and orthogonality (smaller generally favored).
3. Mixed-Precision MRRR and Improved Accuracy
Mixed-precision innovations address MRRR’s classical orthogonality limitations:
- Precision Levels: Input/output is in -bit (), internal sensitive steps in higher -bit precision (). Examples: single → double or double → quadruple.
- Critical Operations: All spectrum shifts, RRR constructions, and RQI are conducted in -bit to make . Non-critical routines (e.g., bisection for eigenvalue refinement) may remain in -bit to minimize performance cost (Petschow et al., 2013).
- Enhanced Error Bounds: With appropriate and typically shallow , mixed-precision MRRR achieves
and
This matches the orthogonality of QR/Divide–Conquer methods while preserving the speed and scalability of MRRR (Petschow et al., 2013).
4. Scalability, Parallel Performance, and Implementation
MRRR is designed for high scalability in both shared-memory and distributed environments:
- Multi-core (SMP) Parallelism: The algorithm is expressed in terms of independent tasks (singleton and cluster), enabling a shared FIFO queue processed by concurrent threads. Load balancing is enhanced via R-tasks which subdivide large clusters’ eigenvalue refinement (Petschow, 2014).
- Distributed-Memory (MPI) Parallelism: Eigenpairs are distributed over processes (approximately per process). Each process operates on its set of indices, with minimal communication except when clusters span processes. Nonblocking MPI is used for necessary broadcasts and to overlap computation with communication (Petschow, 2014).
- Memory and Data Layout: MRRR requires only workspace, substantially less than needed by Divide–Conquer for explicit eigenvector backtransforms. Mixed-precision adds a negligible overhead (Petschow, 2014).
- Performance Highlights:
- On platforms like Intel Xeon X7550 (32 cores), single→double mixed-precision MRRR outperforms LAPACK’s single/double-precision MRRR ({\tt SSTEMR}, {\tt DSTEMR}) and is either faster or comparable to Divide–Conquer, with substantially improved orthogonality (Petschow et al., 2013).
- In large dense eigensolvers (including the tridiagonal reduction stage), the mixed-precision tridiagonal solution is not rate limiting but produces vectors matching the quality of QR and DC (Petschow et al., 2013).
- Parallel implementations such as PMRRR and MR3SMP achieve 80–90% efficiency scaling to thousands of cores (Petschow, 2014).
5. Algorithmic and Practical Considerations
Table 1 summarizes recommended parameter choices and practical guidelines for mixed-precision MRRR-based eigensolvers, as given in (Petschow et al., 2013):
| Component | Recommendation / Notes |
|---|---|
| ; push lower for parallelism | |
| RRR constants | ; |
| RQI stopping | , set |
| Orthogonality | Expect with proper |
Key workflow steps:
- Input: Read in -bit; convert to -bit for RRR work if -bit or mixed arithmetic not hardware-accelerated.
- Preprocessing: Scale/split in -bit.
- RRR Core: All sensitive operations in -bit; apply perturbation of magnitude .
- Eigenvalue Refinement: Coarse and fine bisection in -bit.
- Main Loop: Queue tasks; for singletons, twisted-factorization RQI in -bit to ; for clusters, form child RRRs.
- Backconvert: Final to -bit; enforce orthonormality if desired.
6. Comparative Landscape and Applicability
MRRR is contrasted with alternatives as follows:
- Divide–Conquer (DC): worst-case arithmetic cost, storage, high BLAS-3 intensity; sometimes fastest for well-deflated, full-spectrum cases at moderate core counts (16), but loses to MRRR for larger parallel jobs (Petschow, 2014).
- QR Iteration: Also , less favorable for large or partial spectra.
- Bisection+Inverse Iteration: cost, poor scalability in the presence of eigenvalue clusters.
- MRRR: , minimal memory, scalable, especially advantageous when (partial spectrum), on massive or hybrid compute resources, or when orthogonality requirements demand mixed-precision (Petschow et al., 2013, Petschow, 2014).
Typical recommended scenarios for MRRR or mixed-precision MRRR:
- Large-scale , where MRRR is fastest, and high-quality eigenvectors are needed for spectral clustering or as Rayleigh–Ritz bases.
- Multi-threaded/distributed settings facing load imbalance from eigenvalue clustering; aggressively small disaggregates clusters and greatly enhances parallelism.
- Emerging mixed-precision hardware where -bit operations are no longer performance limiting relative to -bit (Petschow et al., 2013).
7. Implementation Notes and References
- Practical implementations utilize explicit factor storage (lower-bidiagonal or -representation), with recommended block sizes for dense stages (96–128 for in Elemental or ScaLAPACK) (Petschow, 2014).
- Careful attention to stable spectrum shifting (typically via DSTQDS/DQDS) is required to maintain element-wise mixed stability.
- Depth-first recursion is particularly efficient in mixed-precision, as it collapses cluster depth to (Petschow, 2014).
Principal references include:
- Dhillon & Parlett, SIAM J. Matrix Anal. Appl. 2004 (original algorithm).
- Willems & Lang, BIT 2011 (perturbation theory).
- Bientinesi et al., PMRRR (parallel implementation).
- Vömel, LAPACK Working Note 2010 (ScaLAPACK’s MRRR).
- (Petschow et al., 2013) for mixed-precision methodology and extensive accuracy/performance data.
- (Petschow, 2014) for multi-core/distributed algorithms, tuning, and comparative studies.
MRRR, particularly when augmented with mixed-precision techniques, now provides eigensolvers that are simultaneously scalable, accurate, and memory-efficient, enabling robust large-scale spectral computations in both traditional and emerging computational environments.