Two-Level QR/Givens Factorization

Updated 1 February 2026

Two-Level QR/Givens Factorization is a hierarchical method that splits the QR process into local (per-block) and global reduction phases to minimize communication.
It features parallel and sequential implementations, using binary and flat tree structures to optimize computations across distributed and out-of-core environments.
The approach integrates with CAQR and leverages both Householder and Givens rotations, ensuring numerical stability while approaching lower bounds for data movement.

A two-level QR/Givens factorization refers to hierarchical algorithms for computing the QR factorization of matrices—specifically, approaches such as the Tall Skinny QR (TSQR) and its Givens-rotation variant—which minimize communication and optimize computational efficiency for both parallel and sequential platforms. These methods are constructed from sequences of orthogonal transformations and are foundational in broader communication-avoiding factorizations such as CAQR (Communication-Avoiding QR), which targets distributed and large-scale linear algebra problems (0806.2159).

1. Structural Hierarchy of Two-Level TSQR

TSQR factors an $m \times n$ matrix $A$ using two levels of orthogonal transformations. The matrix is partitioned into $P$ block-rows $A_0, \ldots, A_{P-1}$ , each of size $\frac{m}{P} \times n$ . The algorithm comprises:

First level (local QR): Each processor $i$ performs a Householder QR factorization on its block $A_i$ : $A_i = Q_{i,0} R_{i,0}, \qquad Q_{i,0} = \prod_{j=1}^n (I - \tau^i_j v^i_j (v^i_j)^T)$ where $v^i_j$ are local Householder vectors and $\tau^i_j$ their scalars.

Second level (global reduction): The upper triangular $R_{i,0}$ from each block-row are stacked in a pairwise manner along a binary tree structure. At each level $k$ , matrices $R_{\ell, k-1}$ and $R_{r, k-1}$ are concatenated: $C_{i,k} = \begin{pmatrix} R_{\ell, k-1} \ R_{r, k-1} \end{pmatrix}$ and a new QR factorization is computed: $C_{i,k} = Q_{i,k} R_{i,k}$ The process recurses until the root yields the global $R$ .

The overall block notation for the two-level factorization is: $A = \mathrm{diag}(Q_{i,0}) \cdot \mathrm{diag}(Q_{i,1}) \cdots \mathrm{diag}(Q_{i,\log P}) R_\mathrm{root}$ This design explicitly exposes the hierarchical communication and computation pattern inherent in TSQR (0806.2159).

2. Algorithmic Implementations and Pseudocode

Two principal implementations leverage this structure: a parallel version utilizing a binary reduction tree and a sequential “flat tree” version optimal for out-of-core settings.

Parallel TSQR: The matrix data are distributed in a 1-D block row layout, with the reduction following a binary tree across $P$ processors. The essential steps are:

Local QR on each block $A_i$ to obtain $[Y_{i,0}, \tau_{i,0}, R_{i,0}]$ .
For $k=1$ $k = 1$ to $\log_2 P$ $lo g_{2} P$ :
- If processor $i$ is first in its pair, it receives a block $R_{j,k-1}$ , stacks, computes QR, and sends $R_{i,k}$ upward unless it is at the root.
- Otherwise, it sends $R_{i,k-1}$ and exits.

Sequential TSQR: Optimized for disk-resident data, the process reads one block at a time into fast memory, factors the first block, and successively stacks and factors with the previous $R$ :

Read $A_0$ ; compute QR.
For $k = 1$ $k = 1$ to $P-1$ $P - 1$ :
- Read $A_k$ , stack with $R_{0,k-1}$ , compute QR.

Both versions retain the $Y$ and $\tau$ vectors/matrices, thus storing $Q$ in a compact, implicit form.

3. Communication and Computational Complexity

Letting $\alpha$ denote latency, $\beta$ inverse bandwidth, and $\gamma$ flop time, TSQR optimally minimizes data movement within the constraints of hierarchical memory or distributed-memory settings:

Parallel TSQR (on $P$ , with $m/P \geq n$ ):

Floating-point operations: $\frac{2mn^2}{P} + \frac{2n^3}{3} \log_2 P$
Words moved: $\frac{n^2}{2} \log_2 P$
Messages: $\log_2 P$

Sequential TSQR (memory $W$ ):

Flops: $2mn^2 - \frac{2}{3}n^3$
Words: $2mn + O(n^2) + \frac{mn^2}{W} \approx 2mn$
Messages: $\frac{2mn}{W} + O(n) \approx \frac{2mn}{W}$

These asymptotic bounds show that TSQR and its variants approach known lower bounds for communication volume and latency, up to polylogarithmic factors (0806.2159).

4. Numerical Stability and Comparison

TSQR is composed strictly of orthogonal transformations—either Householder reflectors or Givens rotations—guaranteeing the same backward error bounds as classical Householder QR:

$\|I - Q^T Q\| = O(\varepsilon)$ ; orthogonality is preserved to machine precision.

In contrast, alternative QR algorithms have weaker stability properties:

CholeskyQR: $O(\varepsilon\, \kappa(A)^2)$ loss of orthogonality.
Modified Gram-Schmidt: $O(\varepsilon\, \kappa(A))$ .
Classical Gram-Schmidt: arbitrarily poor for ill-conditioned matrices.

TSQR is thus especially preferable for ill-conditioned problems where loss of orthogonality is a concern (0806.2159).

5. Givens-Rotation Variant of Two-Level Factorization

The TSQR scheme can be implemented with Givens rotations instead of Householder reflectors. For each local block $[B; C]$ , Givens rotations $G_{i,j}$ are applied sequentially to eliminate sub-diagonal entries. The reduction phase replaces the local QR with the application of a block of Givens rotations to stack pairs of $n \times n$ upper-triangular $R$ factors, with the $(i,j,c,s)$ parameters recorded for reconstructing $Q$ or $Q^T$ via forward or reverse application.

Critically, the communication profile (in terms of words moved and messages) remains unchanged in the Givens variant relative to the Householder approach, affecting only local computational patterns (0806.2159).

6. Integration in Communication-Avoiding QR (CAQR)

CAQR applies TSQR as its panel factorization primitive within a more general 2D block-cyclic data layout. For a general $m \times n$ matrix distributed on a $P_r \times P_c$ grid:

Each panel is factored using TSQR across the $P_r$ processors corresponding to that block column.
The resulting $Q$ factors ( $Y$ and $\tau$ representations) are broadcast across processor rows.
The $Q^T$ transformation is applied to update the trailing matrix blocks, following TSQR logic.

CAQR thus achieves the optimal computational bound $O(\frac{mn^2}{P})$ and matches known lower bounds for words and messages in the 2D setting, modulo polylogarithmic factors:

Words: $O(\sqrt{mn^3/P})$
Messages: $O(\sqrt{nP/m})$

The optimality of CAQR directly follows from the communication and compute properties of its TSQR component (0806.2159).

Markdown Report Issue Upgrade to Chat

References (1)

Communication-optimal parallel and sequential QR and LU factorizations: theory and practice (2008)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-Level QR/Givens Factorization.