Papers
Topics
Authors
Recent
Search
2000 character limit reached

K-Dimensional Tensor Tiling

Updated 30 December 2025
  • K-dimensional tensor tiling is a method that partitions multidimensional tensors across devices to minimize communication overhead in parallel deep learning.
  • It unifies data, model, and hybrid parallelism by formulating the tiling decision as a combinatorial optimization problem with explicit cost modeling.
  • The SoyBean system operationalizes these algorithms, demonstrating significant empirical speedups by automatically transforming serial graphs into parallelized execution.

K-dimensional tensor tiling is a systematic method for splitting multidimensional tensors across a set of computational devices such that the overall communication overhead is minimized. Central to parallel deep learning, this framework subsumes data parallelism, model parallelism, and their hybrids by representing parallel strategies as tilings of tensor dimensions. Rigorous treatment of the tensor tiling problem involves combinatorial search over partitionings, explicit modeling of communication costs induced by tiling decisions, and algorithmic solutions that are optimal under practical deep neural network (DNN) architectures. The SoyBean system exemplifies the operationalization of these concepts, transforming serial computational graphs into automatically parallelized forms through optimal tensor tiling (Wang et al., 2018).

1. Formal Definition and Canonical Problem Structure

Let TT be a KKth-order tensor of shape (n1,n2,,nK)(n_1, n_2, \ldots, n_K). A KK-dimensional tiling partitions each tensor mode kk into pkp_k equal, disjoint shards, with pkN+p_k \in \mathbb{N}^+ and each tile of length nk/pkn_k/p_k (assuming nkn_k divisible by pkp_k). The full tiling is specified by the tuple KK0. For KK1 devices, the partitioning must satisfy KK2, with devices indexed by KK3-vectors KK4, KK5.

The search space of tensor tilings encompasses all KK6-tuples of integers KK7 with KK8. With increasing device count, the number of possible splits grows combinatorially. Expanding the search space to KK9 is possible by replicating tiles, though the fundamental challenge remains the optimal split configuration.

2. Communication-Cost Modeling for Operator Graphs

A neural network dataflow graph (n1,n2,,nK)(n_1, n_2, \ldots, n_K)0 consists of tensor operations (edges (n1,n2,,nK)(n_1, n_2, \ldots, n_K)1). Each operator (n1,n2,,nK)(n_1, n_2, \ldots, n_K)2 reads input tensors, applies functions, and produces output tensors. Once a global tiling (n1,n2,,nK)(n_1, n_2, \ldots, n_K)3 is assigned, tensor partitions induce “halo” exchanges or reductions when tiles cross device boundaries.

The total communication volume under a given tiling is formulated as

(n1,n2,,nK)(n_1, n_2, \ldots, n_K)4

where (n1,n2,,nK)(n_1, n_2, \ldots, n_K)5 denotes the per-element data size (in bytes) or operation-specific weights, and (n1,n2,,nK)(n_1, n_2, \ldots, n_K)6 computes the number of elements communicated across devices for operator (n1,n2,,nK)(n_1, n_2, \ldots, n_K)7 under tiling (n1,n2,,nK)(n_1, n_2, \ldots, n_K)8.

Examining matrix-matrix multiplication, each possible tiling (assignment of splits or replications to operands and results) leads to distinct communication patterns, often falling into a small set of “aligned” cases: those incurring zero communication and those requiring a reduction of partial results. The cost function

(n1,n2,,nK)(n_1, n_2, \ldots, n_K)9

governs the conversion and reduction costs between tiling states.

Special parallelization cases correspond to:

  • Data parallelism: KK0 (batch split, other modes replicated).
  • Model parallelism: KK1 for some mode KK2, all others KK3 (e.g., channel split).
  • Hybrid parallelism: hierarchical multi-mode cuts, e.g., KK4, which reflects splitting one mode, then another within each subgroup.

3. Globally Optimal Tiling Algorithms

The optimization problem requires joint selection of splits across all tensors due to dependency couplings in KK5. Independent tiling per tensor is suboptimal. The solution involves a multistage recursion:

a) One-cut (2-way) tiling:

  • Transform KK6 to an undirected variant KK7, unroll via BFS into levels KK8.
  • Dynamic programming computes KK9, the minimal communication up to level kk0 with boundary tensors in tiling kk1:

kk2

  • For chain-like DNNs, the DP has kk3 complexity (where kk4 and kk5 is small).

b) Recursive kk6-cut for kk7 devices:

  • Recursively apply one-cut tiling to split into two groups, then within each group solve for kk8 further cuts.
  • Let kk9. Define recursively: KK09
  • The total communication cost is pkp_k0. The recursion is globally optimal in polynomial time due to cut commutativity (“flattening theorem”) and the greedy property pkp_k1. Overall, complexity is pkp_k2.

4. Canonical Example: 4D Convolution Tiling

Consider a convolutional layer where the activation tensor pkp_k3 and filter tensor pkp_k4. The convolution output pkp_k5 has shape pkp_k6, pkp_k7 for pkp_k8.

Three canonical parallelization/tiling schemes:

  • Data parallel (pkp_k9): batch dimension split; no forward communication; backward all-reduce on pkN+p_k \in \mathbb{N}^+0 (pkN+p_k \in \mathbb{N}^+1).
  • Model parallel (pkN+p_k \in \mathbb{N}^+2): split in-channel pkN+p_k \in \mathbb{N}^+3; requires reduce-sum on pkN+p_k \in \mathbb{N}^+4 and analogous backward exchanges (pkN+p_k \in \mathbb{N}^+5).
  • Hybrid (pkN+p_k \in \mathbb{N}^+6): first split batch, then split channels within groups; total communication is pkN+p_k \in \mathbb{N}^+7 (all-reduce), pkN+p_k \in \mathbb{N}^+8 (per group), total pkN+p_k \in \mathbb{N}^+9.

Numerical example:

  • nk/pkn_k/p_k0, nk/pkn_k/p_k1, nk/pkn_k/p_k2, nk/pkn_k/p_k3, nk/pkn_k/p_k4
  • nk/pkn_k/p_k5M, nk/pkn_k/p_k6M bytes
  • Data-parallel: nk/pkn_k/p_k7M bytes/iteration, model-parallel: nk/pkn_k/p_k8M, hybrid varies (e.g., nk/pkn_k/p_k9M nkn_k0 nkn_k1M nkn_k2 nkn_k3M for nkn_k4), with further trade-offs depending on nkn_k5, nkn_k6.

The dynamic programming/nkn_k7-cut search automatically explores these hybrid strategies and selects the minimum cost configuration.

5. System Integration: SoyBean Architecture and Empirical Results

SoyBean processes a serial dataflow graph (e.g., from MXNet or TensorFlow) and performs:

  • Optimal Tiling: Executes the nkn_k8-cut algorithm to assign tensor-specific tile vectors nkn_k9.
  • Device Placement: Maps pkp_k0-index blocks to physical devices, prioritizing hardware hierarchy (first slow links, then internal cuts).
  • Graph Rewriting: Expands operators into pkp_k1 sub-operators for corresponding shards and inserts required halo-exchange or all-reduce operations to handle tiling conversions.
  • Execution: Dispatches the partitioned graph to the standard dataflow runtime.

Empirical speedups on 8-GPU hardware:

  • AlexNet (batch 256): SoyBean achieves pkp_k2 single-GPU speedup, whereas data parallelism requires batch 1024 for comparable scaling.
  • VGG (batch 256): SoyBean achieves 5–6pkp_k3 at 8 GPUs; data-parallel peaks at pkp_k4 unless batch pkp_k5 256.
  • General result: Across AlexNet/VGG and batch sizes, SoyBean is 1.5–4pkp_k6 faster than pure data parallelism, as it identifies hybrid splits minimizing pkp_k7 (Wang et al., 2018).

6. Implications and Generalization of K-Dimensional Tensor Tiling

  • Expressiveness: Any parallelization choice, including data, model, and mixed strategies, can be posed as a pkp_k8-vector pkp_k9 of splits.
  • Optimality: For chain-structured DNN graphs, the KK00-cut algorithm is provably globally optimal in polynomial time.
  • Extendability: The tiling set KK01partition along any mode KK02, replicateKK03 can be expanded to support advanced splits (e.g., group convolution partitions), with the dynamic programming/KK04-cut machinery still directly applicable.
  • Systems Integration: Elevating tensor tiling to a primary systems abstraction enables unification and outperformance of hand-tuned parallelism strategies, serving as a functional backend for any dataflow-based deep learning system.

7. Table: Parallelism Schemes under K-Dimensional Tensor Tiling

Parallelism Type Tiling tuple KK05 Communication Pattern
Data Parallelism KK06 Batch split, all-reduce on weights
Model Parallelism KK07 Split one model axis, reduce on output
Hybrid Parallelism KK08 Hierarchical, mixes splits and reduces

These canonical strategies exemplify how the tensor tiling framework encapsulates parallelism choices and highlights the trade-offs in communication cost and empirical efficiency (Wang et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to K-Dimensional Tensor Tiling.