Papers
Topics
Authors
Recent
Search
2000 character limit reached

Two-Level QR/Givens Factorization

Updated 1 February 2026
  • Two-Level QR/Givens Factorization is a hierarchical method that splits the QR process into local (per-block) and global reduction phases to minimize communication.
  • It features parallel and sequential implementations, using binary and flat tree structures to optimize computations across distributed and out-of-core environments.
  • The approach integrates with CAQR and leverages both Householder and Givens rotations, ensuring numerical stability while approaching lower bounds for data movement.

A two-level QR/Givens factorization refers to hierarchical algorithms for computing the QR factorization of matrices—specifically, approaches such as the Tall Skinny QR (TSQR) and its Givens-rotation variant—which minimize communication and optimize computational efficiency for both parallel and sequential platforms. These methods are constructed from sequences of orthogonal transformations and are foundational in broader communication-avoiding factorizations such as CAQR (Communication-Avoiding QR), which targets distributed and large-scale linear algebra problems (0806.2159).

1. Structural Hierarchy of Two-Level TSQR

TSQR factors an m×nm \times n matrix AA using two levels of orthogonal transformations. The matrix is partitioned into PP block-rows A0,,AP1A_0, \ldots, A_{P-1}, each of size mP×n\frac{m}{P} \times n. The algorithm comprises:

First level (local QR): Each processor ii performs a Householder QR factorization on its block AiA_i: Ai=Qi,0Ri,0,Qi,0=j=1n(Iτjivji(vji)T)A_i = Q_{i,0} R_{i,0}, \qquad Q_{i,0} = \prod_{j=1}^n (I - \tau^i_j v^i_j (v^i_j)^T) where vjiv^i_j are local Householder vectors and τji\tau^i_j their scalars.

Second level (global reduction): The upper triangular Ri,0R_{i,0} from each block-row are stacked in a pairwise manner along a binary tree structure. At each level kk, matrices R,k1R_{\ell, k-1} and Rr,k1R_{r, k-1} are concatenated: Ci,k=(R,k1 Rr,k1)C_{i,k} = \begin{pmatrix} R_{\ell, k-1} \ R_{r, k-1} \end{pmatrix} and a new QR factorization is computed: Ci,k=Qi,kRi,kC_{i,k} = Q_{i,k} R_{i,k} The process recurses until the root yields the global RR.

The overall block notation for the two-level factorization is: A=diag(Qi,0)diag(Qi,1)diag(Qi,logP)RrootA = \mathrm{diag}(Q_{i,0}) \cdot \mathrm{diag}(Q_{i,1}) \cdots \mathrm{diag}(Q_{i,\log P}) R_\mathrm{root} This design explicitly exposes the hierarchical communication and computation pattern inherent in TSQR (0806.2159).

2. Algorithmic Implementations and Pseudocode

Two principal implementations leverage this structure: a parallel version utilizing a binary reduction tree and a sequential “flat tree” version optimal for out-of-core settings.

Parallel TSQR: The matrix data are distributed in a 1-D block row layout, with the reduction following a binary tree across PP processors. The essential steps are:

  1. Local QR on each block AiA_i to obtain [Yi,0,τi,0,Ri,0][Y_{i,0}, \tau_{i,0}, R_{i,0}].
  2. For k=1k=1 to log2P\log_2 P:
    • If processor ii is first in its pair, it receives a block Rj,k1R_{j,k-1}, stacks, computes QR, and sends Ri,kR_{i,k} upward unless it is at the root.
    • Otherwise, it sends Ri,k1R_{i,k-1} and exits.

Sequential TSQR: Optimized for disk-resident data, the process reads one block at a time into fast memory, factors the first block, and successively stacks and factors with the previous RR:

  1. Read A0A_0; compute QR.
  2. For k=1k = 1 to P1P-1:
    • Read AkA_k, stack with R0,k1R_{0,k-1}, compute QR.

Both versions retain the YY and τ\tau vectors/matrices, thus storing QQ in a compact, implicit form.

3. Communication and Computational Complexity

Letting α\alpha denote latency, β\beta inverse bandwidth, and γ\gamma flop time, TSQR optimally minimizes data movement within the constraints of hierarchical memory or distributed-memory settings:

Parallel TSQR (on PP, with m/Pnm/P \geq n):

  • Floating-point operations: 2mn2P+2n33log2P\frac{2mn^2}{P} + \frac{2n^3}{3} \log_2 P
  • Words moved: n22log2P\frac{n^2}{2} \log_2 P
  • Messages: log2P\log_2 P

Sequential TSQR (memory WW):

  • Flops: 2mn223n32mn^2 - \frac{2}{3}n^3
  • Words: 2mn+O(n2)+mn2W2mn2mn + O(n^2) + \frac{mn^2}{W} \approx 2mn
  • Messages: 2mnW+O(n)2mnW\frac{2mn}{W} + O(n) \approx \frac{2mn}{W}

These asymptotic bounds show that TSQR and its variants approach known lower bounds for communication volume and latency, up to polylogarithmic factors (0806.2159).

4. Numerical Stability and Comparison

TSQR is composed strictly of orthogonal transformations—either Householder reflectors or Givens rotations—guaranteeing the same backward error bounds as classical Householder QR:

  • IQTQ=O(ε)\|I - Q^T Q\| = O(\varepsilon); orthogonality is preserved to machine precision.

In contrast, alternative QR algorithms have weaker stability properties:

  • CholeskyQR: O(εκ(A)2)O(\varepsilon\, \kappa(A)^2) loss of orthogonality.
  • Modified Gram-Schmidt: O(εκ(A))O(\varepsilon\, \kappa(A)).
  • Classical Gram-Schmidt: arbitrarily poor for ill-conditioned matrices.

TSQR is thus especially preferable for ill-conditioned problems where loss of orthogonality is a concern (0806.2159).

5. Givens-Rotation Variant of Two-Level Factorization

The TSQR scheme can be implemented with Givens rotations instead of Householder reflectors. For each local block [B;C][B; C], Givens rotations Gi,jG_{i,j} are applied sequentially to eliminate sub-diagonal entries. The reduction phase replaces the local QR with the application of a block of Givens rotations to stack pairs of n×nn \times n upper-triangular RR factors, with the (i,j,c,s)(i,j,c,s) parameters recorded for reconstructing QQ or QTQ^T via forward or reverse application.

Critically, the communication profile (in terms of words moved and messages) remains unchanged in the Givens variant relative to the Householder approach, affecting only local computational patterns (0806.2159).

6. Integration in Communication-Avoiding QR (CAQR)

CAQR applies TSQR as its panel factorization primitive within a more general 2D block-cyclic data layout. For a general m×nm \times n matrix distributed on a Pr×PcP_r \times P_c grid:

  1. Each panel is factored using TSQR across the PrP_r processors corresponding to that block column.
  2. The resulting QQ factors (YY and τ\tau representations) are broadcast across processor rows.
  3. The QTQ^T transformation is applied to update the trailing matrix blocks, following TSQR logic.

CAQR thus achieves the optimal computational bound O(mn2P)O(\frac{mn^2}{P}) and matches known lower bounds for words and messages in the 2D setting, modulo polylogarithmic factors:

  • Words: O(mn3/P)O(\sqrt{mn^3/P})
  • Messages: O(nP/m)O(\sqrt{nP/m})

The optimality of CAQR directly follows from the communication and compute properties of its TSQR component (0806.2159).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-Level QR/Givens Factorization.