Cyclops Tensor Framework (CTF)

Updated 7 February 2026

CTF is a high-level, modular library that provides distributed-memory tensor contractions and transpositions using advanced node-aware optimizations.
It uses a SUMMA-based strategy and optimized processor-grid selection to reduce communication overhead and enhance scalability on modern supercomputers.
CTF’s integration with quantum chemistry applications demonstrates significant speedups, memory efficiency, and improved performance over conventional methods.

The Cyclops Tensor Framework (CTF) is a modular high-level software library providing distributed-memory tensor algebra capabilities, aimed at facilitating efficient parallel execution of tensor contractions, transpositions, and operations with explicit support for tensor symmetries. CTF orchestrates contractions over multidimensional arrays distributed across a multi-dimensional processor grid, and incorporates advanced planning logic and node-aware communication optimizations to minimize communication complexity and improve scalability, particularly on modern supercomputing architectures with multiple cores and hierarchical memory networks (Irmler et al., 2023). It supports C and C++ APIs and is designed to interoperate with high-level scientific codes, notably in electronic structure and quantum chemistry computations.

1. Modular Architecture and Baseline Distributed Contraction

The CTF library employs a design in which each tensor is represented as a multidimensional array distributed across a $p_1 \times p_2 \times \dots \times p_d$ processor grid. The core functionality includes contraction routines, transposition operators, and support for tensor symmetries. Contractions are decomposed via the SUMMA paradigm into a sequence of two-dimensional matrix multiplications, reductions, and replications. When additional memory is available, CTF employs a 2.5D memory–communication trade-off to further reduce communication overhead.

Upon invocation, the CTF runtime selects an optimal processor grid for the contraction by evaluating a performance model that balances local floating point operations (FLOPs), communication volume, and any required data redistribution. Input tensors are typically redistributed to the selected grid through a round of point-to-point messages. SUMMA-style contractions proceed in multiple steps: operand panels are broadcast along grid fibers, followed by local matrix-matrix multiplication, and finally, reduction of partial results across processors. Communication costs in the baseline implementation are characterized by a per-processor bandwidth proportional to $\sum_i W_i$ (where $W_i$ is the broadcast or reduction size per fiber) and latency cost $O(\log p)$ per SUMMA step. The default model does not distinguish between intra- and inter-node communication asymmetries (Irmler et al., 2023).

2. Node-Aware Processor-Grid Selection and Communication Minimization

Modern clusters typically feature multi-core nodes, each hosting multiple MPI ranks. Collective communication operations such as broadcast and reduction suffer increased contention and network volume when communicators span multiple nodes. The node-aware processor-grid algorithm is therefore implemented in CTF to minimize the volume of inter-node communication by optimizing the processor grid assignments along each tensor dimension.

Given a processor grid $p_i$ along each dimension and $m$ MPI ranks per node, the grid is factorized such that each $p_i$ is split into $m_i$ intra-node and $\frac{p_i}{m_i}$ inter-node components, with $\prod_i m_i = m$ and $p_i \bmod m_i = 0$ . The inter-node communication volume for the contraction is quantified as

$V = \sum_{i=1}^d W_i \left(\frac{p_i}{m_i} - 1 \right),$

where $W_i$ is the data volume per fiber. The selection of $(m_1, \dots, m_d)$ that minimizes $V$ is performed, with an exhaustive search feasible up to $d \sim 5$ -$6$ dimensions; higher orders may require heuristics. The chosen node-aware grid is effected by a one-time data redistribution. Broadcasts and reductions are performed on node-local sub-communicators using shared memory and a single inter-node message per fiber. The new scheme is user-accessible through the CTF API by enabling enable_node_aware(true) on the tensor contraction planner (Irmler et al., 2023).

3. CTF Implementation Details and API Integration

The node-aware extension necessitated enhancements to the internal Grid object, introducing hierarchies for inter-node and intra-node dimensions alongside per-fiber communicator handles. The "redistributor" routines orchestrate data reblocking between old and new grid layouts via MPI_Alltoallv-like exchanges. Only minimal modifications to CTF's SUMMA driver were required—broadcasts and reductions now utilize node-local shared memory whenever possible. The extension is transparent to existing CTF tensor-contraction routines, including symmetric-packed format support, thereby preserving API compatibility. Users employ the enable_node_aware flag to activate node-aware optimization for individual contractions (Irmler et al., 2023).

4. Performance Characterization on Modern Supercomputing Hardware

CTF's node-aware capabilities have been extensively benchmarked on several large-scale systems, notably the Raven (MPCDF, 72-core Intel IceLake) and Stampede2 (68-core Xeon Phi) clusters. The principal points of reference include the default CTF implementation (CTF-def), the node-aware variation (CTF-na), ScaLAPACK (Intel MKL), and COSMA (with both limited and unlimited memory regimes). Tests considered various contraction shapes: square, "large K," "large M," and "small K" geometries under both strong and weak scaling.

Empirical highlights include:

For square contractions, CTF-na yields up to 2.6× speedup relative to CTF-def at node counts above 50, and up to 5× in extreme cases.
In "large K" cases, strong advantages over ScaLAPACK (up to 4.1× at high node counts) are observed, with CTF-na matching or outperforming COSMA-limited below 50 nodes and requiring far less memory than COSMA-unlimited.
Weak scaling results demonstrate performance stability with CTF-na across increasing node counts (Irmler et al., 2023).

Contraction Shape	Speedup vs CTF-def (>50 nodes)	Speedup vs ScaLAPACK (>50 nodes)	Speedup vs COSMA-lim (>50 nodes)
Square	2.6	2.3	1.8
Large K	1.0	4.1	1.3
Large M	1.3	1.3	1.6
Small K	1.6	1.5	1.7

5. Integration with Quantum-Chemical Coupled-Cluster Workloads

CTF's compatibility with electronic-structure applications is demonstrated via integration into quantum-chemical coupled-cluster codes, specifically Cc4s for CCSD (coupled-cluster singles and doubles) and drCCD amplitude updates. The CTF planner, with node-awareness enabled, selects grids to optimize for each contraction, incorporating the redistribution cost in its local performance model.

Experimental runs with realistic problem sizes (occupied dimensions 116–164, virtual dimensions 1,161–1,642, and node counts up to 128) show that:

For the drCCD term, switching from default to node-aware CTF yields minor improvement at small scales (~1.3%) but up to 3× higher local GFLOPS per core at ~100 nodes.
For CCSD (excluding on-the-fly terms), per-core GFLOPS improvements from CTF-def to CTF-na are 22% at 32 nodes, 47% at 72 nodes, and 0.5% at 128 nodes where the default allocation was already optimal.
Node-aware processor grids ensure consistent performance, avoiding deleterious allocations occasionally selected by default heuristics, especially when broadcast volumes $W_i$ are balanced across contractions (Irmler et al., 2023).

6. Practical Usage Guidelines and Limitations

Node-aware grid optimization in CTF confers the greatest benefit for large-scale jobs (≫50 nodes) with many MPI ranks per node, particularly when contraction structures are balanced (e.g., square or small-K) and would otherwise incur substantial inter-node broadcast cost. The overhead includes an extra data-redistribution step per contraction (explicitly modeled in performance selection) and negligible extra intra-node memory for new communicators. The search for optimal intra-node factorizations of the processor grid is tractable for contraction orders up to $d \sim 5$ –6; beyond this, heuristic approaches may be requisite. For highly skewed contraction shapes ("large K" or "large M"), the default grid allocation already minimizes inter-node volume, and node-aware strategies confer little to no advantage.

A plausible implication is that for workflows dominated by unbalanced contraction geometries or frequent tiny contractions, node-aware redistribution costs may outweigh runtime reductions. Nevertheless, the performance modeling within CTF's planner accurately accounts for such scenarios and selects accordingly (Irmler et al., 2023).

7. Summary and Outlook

The Cyclops Tensor Framework, augmented with node-aware processor grid optimization, achieves substantial speedups in distributed tensor contraction workloads without recourse to hardware-specific MPI tuning. The introduction of hierarchical grid layouts and the explicit minimization of the inter-node communication cost model

$V = \sum_{i=1}^d W_i\left(\frac{p_i}{m_i} - 1\right)$

enables up to 5× acceleration on matrix–matrix multiplication at scale and 20–50% gains in strongly scaled coupled-cluster builds. CTF maintains API compatibility and requires only a single flag activation to engage these optimizations. The framework remains competitive with state-of-the-art libraries such as ScaLAPACK and COSMA, particularly when memory constraints are stringent. Continued evolution can be expected in the domain of automatic grid selection heuristics for higher-order contractions and direct support for further irreducible symmetry exploitation (Irmler et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Optimizing Distributed Tensor Contractions using Node-Aware Processor Grids (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cyclops Tensor Framework (CTF).