Two-Level QR/Givens Factorization
- Two-Level QR/Givens Factorization is a hierarchical method that splits the QR process into local (per-block) and global reduction phases to minimize communication.
- It features parallel and sequential implementations, using binary and flat tree structures to optimize computations across distributed and out-of-core environments.
- The approach integrates with CAQR and leverages both Householder and Givens rotations, ensuring numerical stability while approaching lower bounds for data movement.
A two-level QR/Givens factorization refers to hierarchical algorithms for computing the QR factorization of matrices—specifically, approaches such as the Tall Skinny QR (TSQR) and its Givens-rotation variant—which minimize communication and optimize computational efficiency for both parallel and sequential platforms. These methods are constructed from sequences of orthogonal transformations and are foundational in broader communication-avoiding factorizations such as CAQR (Communication-Avoiding QR), which targets distributed and large-scale linear algebra problems (0806.2159).
1. Structural Hierarchy of Two-Level TSQR
TSQR factors an matrix using two levels of orthogonal transformations. The matrix is partitioned into block-rows , each of size . The algorithm comprises:
First level (local QR): Each processor performs a Householder QR factorization on its block : where are local Householder vectors and their scalars.
Second level (global reduction): The upper triangular from each block-row are stacked in a pairwise manner along a binary tree structure. At each level , matrices and are concatenated: and a new QR factorization is computed: The process recurses until the root yields the global .
The overall block notation for the two-level factorization is: This design explicitly exposes the hierarchical communication and computation pattern inherent in TSQR (0806.2159).
2. Algorithmic Implementations and Pseudocode
Two principal implementations leverage this structure: a parallel version utilizing a binary reduction tree and a sequential “flat tree” version optimal for out-of-core settings.
Parallel TSQR: The matrix data are distributed in a 1-D block row layout, with the reduction following a binary tree across processors. The essential steps are:
- Local QR on each block to obtain .
- For to :
- If processor is first in its pair, it receives a block , stacks, computes QR, and sends upward unless it is at the root.
- Otherwise, it sends and exits.
Sequential TSQR: Optimized for disk-resident data, the process reads one block at a time into fast memory, factors the first block, and successively stacks and factors with the previous :
- Read ; compute QR.
- For to :
- Read , stack with , compute QR.
Both versions retain the and vectors/matrices, thus storing in a compact, implicit form.
3. Communication and Computational Complexity
Letting denote latency, inverse bandwidth, and flop time, TSQR optimally minimizes data movement within the constraints of hierarchical memory or distributed-memory settings:
Parallel TSQR (on , with ):
- Floating-point operations:
- Words moved:
- Messages:
Sequential TSQR (memory ):
- Flops:
- Words:
- Messages:
These asymptotic bounds show that TSQR and its variants approach known lower bounds for communication volume and latency, up to polylogarithmic factors (0806.2159).
4. Numerical Stability and Comparison
TSQR is composed strictly of orthogonal transformations—either Householder reflectors or Givens rotations—guaranteeing the same backward error bounds as classical Householder QR:
- ; orthogonality is preserved to machine precision.
In contrast, alternative QR algorithms have weaker stability properties:
- CholeskyQR: loss of orthogonality.
- Modified Gram-Schmidt: .
- Classical Gram-Schmidt: arbitrarily poor for ill-conditioned matrices.
TSQR is thus especially preferable for ill-conditioned problems where loss of orthogonality is a concern (0806.2159).
5. Givens-Rotation Variant of Two-Level Factorization
The TSQR scheme can be implemented with Givens rotations instead of Householder reflectors. For each local block , Givens rotations are applied sequentially to eliminate sub-diagonal entries. The reduction phase replaces the local QR with the application of a block of Givens rotations to stack pairs of upper-triangular factors, with the parameters recorded for reconstructing or via forward or reverse application.
Critically, the communication profile (in terms of words moved and messages) remains unchanged in the Givens variant relative to the Householder approach, affecting only local computational patterns (0806.2159).
6. Integration in Communication-Avoiding QR (CAQR)
CAQR applies TSQR as its panel factorization primitive within a more general 2D block-cyclic data layout. For a general matrix distributed on a grid:
- Each panel is factored using TSQR across the processors corresponding to that block column.
- The resulting factors ( and representations) are broadcast across processor rows.
- The transformation is applied to update the trailing matrix blocks, following TSQR logic.
CAQR thus achieves the optimal computational bound and matches known lower bounds for words and messages in the 2D setting, modulo polylogarithmic factors:
- Words:
- Messages:
The optimality of CAQR directly follows from the communication and compute properties of its TSQR component (0806.2159).