Multicore Wavefront Diamond Blocking

Updated 23 January 2026

Multicore Wavefront Diamond Blocking (MWD) is an advanced temporal-blocking technique for stencil computations that fuses diamond tiling and synchronized wavefront updates.
It significantly reduces code balance and DRAM bandwidth usage by enhancing cache reuse and increasing arithmetic intensity, leading to substantial energy savings.
MWD employs multi-dimensional intra-tile parallelization with cooperative thread groups, enabling efficient scaling on multicore and distributed-memory architectures.

Multicore Wavefront Diamond Blocking (MWD) is an advanced temporal-blocking technique for stencil-based numerical computations targeting multicore architectures with multilevel shared caches. MWD fuses diamond tiling for spatial-temporal locality and multicore-aware synchronized wavefront updates, enabling multi-threaded execution with shared cache block residency. This method has proven effective in addressing the memory bandwidth bottleneck fundamental to low-arithmetic-intensity stencil codes, such as those encountered in finite-difference electromagnetics, Jacobi, and variable-coefficient operators. MWD achieves significant reductions in code balance, enhances computational intensity, and offers substantial energy savings through DRAM bandwidth minimization.

1. Diamond-Wavefront Tile Geometry

MWD partitions the computational domain $\Omega = \{0 \leq x < N_x\} \times \{0 \leq y < N_y\} \times \{0 \leq z < N_z\}$ into extruded diamond tiles and advances updates via synchronized wavefront sweeps. Each diamond is defined by its width $D_w$ in the $y$ – $z$ plane and wavefront depth $B_z$ ( $N_F$ in some formulations) in the $z$ (or time) direction, with wavefront width $W_w = D_w + B_z - 1$ for radius- $R=1$ stencils. The cross-section at fixed $x$ is

$T_{yz} = \{ (y, z) \mid z_0 \leq z < z_0 + B_z,\; |y-y_0| + (z-z_0) \leq D_w-1 \}$

yielding a half-diamond extruded over $B_z$ layers. The total $y$ – $z$ points in one tile is

$A_{\text{tile}} = D_w^2/2 + D_w \cdot (B_z - 1)$

(for $B_z \leq D_w$ ), so each tile covers $N_x \cdot A_{\text{tile}}$ grid-point updates.

The sweep proceeds such that dependencies from points $(y\pm1, z-1)$ and $(y, z-1)$ always reside “behind” the advancing wavefront, enabling correctness in concurrent updates. This geometric construction underpins both spatial and temporal reuse (Malas et al., 2015, Malas et al., 2014).

2. Multi-Dimensional Intra-Tile Parallelization and Thread Groups

Traditional single-thread diamond blocking assigns an individual tile to each thread, but MWD introduces thread groups (TGs), typically sized to match cache and core constraints, that cooperatively update one tile. Three orthogonal forms of intra-tile parallelism are exploited:

Division along x: $T_x$ threads handle contiguous $x$ -slabs, optimizing cacheline stride and hardware prefetch efficacy.
Division across field components: For multi-component stencils (e.g. $6\, \vec{E}$ and $6\, \vec{H}$ fields in THIIM FDFD (Malas et al., 2015)), $T_c$ threads are assigned subsets of these.
Optional division in y: Partitioning the $y$ -range among $T_y$ threads can alleviate load imbalance.

All multi-threaded updates within a tile synchronize on each $z$ -layer via barriers. The end-to-end process is managed by a global FIFO of ready tiles, with dependencies dynamically scheduled:

#pragma omp parallel
{
  int tx = omp_get_thread_num() % Tx;
  int tc = omp_get_thread_num() / Tx;
  for (int z = z0; z < z0 + Bz; ++z) {
    int y_min = y0 - (Dw - 1) + (z - z0);
    int y_max = y0 + (Dw - 1) - (z - z0);
    for (int y = y_min; y <= y_max; ++y) {
      int x_begin = x0 + tx * slab_size;
      int x_end = x0 + (tx+1)*slab_size;
      for (int x = x_begin; x < x_end; ++x) {
        update_fields_component_group(x,y,z,tc);
      }
    }
    #pragma omp barrier
  }
}

Cooperation within TGs reduces per-thread cache requirements, increases arithmetic intensity, and enables scalable wavefront concurrency (Malas et al., 2015, Malas et al., 2014).

3. Memory Traffic, Code Balance, and Cache-Footprint Models

A key metric in high-performance stencil codes is code balance $B_C$ (bytes per lattice update, LUP), governing the achievable memory-bound performance $P_{\mathrm{mem}}=b_S/B_C$ with $b_S$ as the available memory bandwidth.

MWD cache footprint models are formulated as:

Cache size for a single extruded diamond:

$C_S = N_{xb}\left[N_D \left(\frac{D_w^2}{2} + D_w (N_F-1)\right) + 2 (D_w + W_w)\right]$

for $R=1$ stencils, where $N_{xb}$ is the $x$ -tile size, $N_D$ the number of domain-sized arrays, and $W_w = D_w + N_F - 2$ (Malas et al., 2014, Malas et al., 2015).

Code balance:

$B_C = \frac{16\left[(2D_w-2)+(N_D D_w+2)\right]}{D_w^2}\quad\text{[bytes/LUP]}, \quad (R=1)$

which asymptotically decreases as $D_w$ increases, with a lower bound of $16 N_D/D_w$ .

MWD drives $B_C$ towards its theoretical minimum, attaining up to an order-of-magnitude reduction in bytes/LUP compared to spatial blocking (e.g., $B_C \sim 200$ –$400$ bytes/LUP vs. $1,216$ bytes/LUP for FDFD spatial block (Malas et al., 2015)) and similarly drastic reductions for other stencil schemes (Malas et al., 2014, Malas et al., 2014). DRAM power consumption correlates nearly linearly with code balance (Malas et al., 2014).

4. End-to-End Parallel Workflow and Distributed Implementation

MWD accommodates hybrid MPI+OpenMP parallelism by decomposing the domain along $z$ (or $t$ ), assigning sub- $z$ -slabs and halo layers to MPI ranks. Within each rank, OpenMP manages tiled sweeps:

MPI_Init...
#pragma omp parallel num_threads(P)
{
  if (omp_get_thread_num() == 0) {
    build_diamond_fifo();
  }
  #pragma omp barrier
  while (not all tiles done) {
    int tile_id = pop_fifo();
    // in-tile sweep as above
    for (int z=z0; z<z0+Bz; ++z) {
      for (int y=y_min(z); y<=y_max(z); ++y) {
        for (int x=x0; x<x0+Nx; ++x) {
          compute_H_updates(x,y,z);
          compute_E_updates(x,y,z);
        }
      }
      #pragma omp barrier
    }
    for (each dependent neighbor tile nt) {
      if (all its deps done) push_fifo(nt);
    }
  }
  #pragma omp barrier
  MPI_Isend(halo_layers,...);
  MPI_Irecv(halo_layers,...);
  MPI_Waitall(...);
}
MPI_Finalize();

Dynamic scheduling of diamonds via FIFO lists, local barriers within TGs, and distributed halo exchanges enable scalable distributed-memory MWD (Malas et al., 2015, Malas et al., 2014).

5. Practical Auto-Tuning for Parameter Selection

Efficient deployment of MWD requires judicious selection of tiling and thread group parameters. The practical auto-tuning strategy proceeds by:

Cache-fit filter: Discard $(D_w, B_z)$ combinations where the predicted $C_S$ exceeds half of L3 cache capacity.
Parameter grid search: For remaining options, benchmark a small set of TG shapes (e.g., $T=6,9,18$ ) for update rate.
Cost model-guided refinement: Finalize $(D_w, B_z, T)$ maximizing performance per memory access (minimizing $B_C$ ); prioritize larger $D_w$ where in-cache reuse can be maximized.

This three-step process identifies near-optimal configurations with only a few dozen trials, avoiding exhaustive full-space searches (Malas et al., 2015). Recommendations include maximizing $D_w$ as feasible, setting $N_F$ ( $=B_z$ ) at or above the group size for concurrency, dynamic tuning of $N_{xb}$ to structure leading dimension access, and exploring TG sizes from one up to all available cores for balanced cache and parallel efficiency (Malas et al., 2014, Malas et al., 2014).

6. Performance, Energy, and Scalability Results

Experimental studies on Intel Haswell (18-core) and Ivy Bridge (10-core) platforms provide quantitative benchmarks:

Pure spatial blocking saturates memory bandwidth at 6–8 cores ( $\approx 50\,\text{GB/s}$ , $40$ MLUP/s).
Single-threaded wavefront-diamond (SWD) shifts saturation but thrashes L3 cache and deteriorates beyond 12 cores.
MWD with shared cache-blocks never hits memory bandwidth limits, maintains code balance (200–400 bytes/LUP), and scales to all cores at $\approx75\%$ parallel efficiency.
MWD achieves $3$– $4\times$ speedup versus best spatial-blocked kernels on 18 cores; DRAM bandwidth savings of 38–80% (Malas et al., 2015).
Roofline analysis: MWD can approach cache-bound ceilings (e.g., $6.5$ GLUP/s in-L3 for 7-point stencil) when $C_S$ fits in cache; at large $D_w,N_F$ , memory traffic is no longer limiting (Malas et al., 2014).
Energy measurements show that DRAM consumption drops near linearly with code balance; optimal energy-to-solution may not coincide with peak speed (e.g., 2WD fastest execution, 10WD lowest energy on the 7-point variable-coefficient stencil (Malas et al., 2014)).

Key empirical highlights on Intel Ivy Bridge:

Variant	GLUP/s (7pt const)	GLUP/s (7pt var)	DRAM BW savings	DRAM energy reduction
Spatial Block	1.6	1.0	Baseline	Baseline
1WD	4.3	2.55	30–68%	57→22 pJ/LUP
10WD	3.8	1.15	up to 80%	77.5→70 pJ/LUP

MWD scalability has also been demonstrated in distributed-memory setups (e.g., near-ideal scaling to 16 sockets for 2WD/5WD/10WD until surface-to-volume ratio effects intervene) (Malas et al., 2014).

7. Methodological Advantages and Limitations

MWD delivers a blend of spatial and temporal blocking, exploiting shared-cache residency and multi-threaded cooperation to decouple memory bandwidth from execution performance in bandwidth-bound stencil codes. The technique markedly reduces code balance (e.g., $5$–$8$ B/LUP vs. $24$–$128$ B/LUP baseline), saves DRAM power, and accommodates fine-grained concurrency essential for strong scaling under MPI.

However, overall speedup is ultimately capped by “in-cache” performance, barrier overhead in large TGs, and cache associativity/capacity conflicts at “edge” sizes. For small domains or many-core nodes with limited diamond concurrency, TG size must be chosen accordingly. Auto-tuning remains crucial for adapting $(D_w,N_F,T)$ to specific stencil types and architectures (Malas et al., 2014, Malas et al., 2014, Malas et al., 2015).

A plausible implication is further acceleration and energy gains on future architectures with more bandwidth-starved memory subsystems, since MWD effectiveness scales with cache/memory hierarchy granularity.

References:

"Optimization of an electromagnetics code with multicore wavefront diamond blocking and multi-dimensional intra-tile parallelization" (Malas et al., 2015)
"Towards energy efficiency and maximum computational intensity for stencil algorithms using wavefront diamond temporal blocking" (Malas et al., 2014)
"Multicore-optimized wavefront diamond blocking for optimizing stencil updates" (Malas et al., 2014)