Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multicore Wavefront Diamond Blocking

Updated 23 January 2026
  • Multicore Wavefront Diamond Blocking (MWD) is an advanced temporal-blocking technique for stencil computations that fuses diamond tiling and synchronized wavefront updates.
  • It significantly reduces code balance and DRAM bandwidth usage by enhancing cache reuse and increasing arithmetic intensity, leading to substantial energy savings.
  • MWD employs multi-dimensional intra-tile parallelization with cooperative thread groups, enabling efficient scaling on multicore and distributed-memory architectures.

Multicore Wavefront Diamond Blocking (MWD) is an advanced temporal-blocking technique for stencil-based numerical computations targeting multicore architectures with multilevel shared caches. MWD fuses diamond tiling for spatial-temporal locality and multicore-aware synchronized wavefront updates, enabling multi-threaded execution with shared cache block residency. This method has proven effective in addressing the memory bandwidth bottleneck fundamental to low-arithmetic-intensity stencil codes, such as those encountered in finite-difference electromagnetics, Jacobi, and variable-coefficient operators. MWD achieves significant reductions in code balance, enhances computational intensity, and offers substantial energy savings through DRAM bandwidth minimization.

1. Diamond-Wavefront Tile Geometry

MWD partitions the computational domain Ω={0x<Nx}×{0y<Ny}×{0z<Nz}\Omega = \{0 \leq x < N_x\} \times \{0 \leq y < N_y\} \times \{0 \leq z < N_z\} into extruded diamond tiles and advances updates via synchronized wavefront sweeps. Each diamond is defined by its width DwD_w in the yyzz plane and wavefront depth BzB_z (NFN_F in some formulations) in the zz (or time) direction, with wavefront width Ww=Dw+Bz1W_w = D_w + B_z - 1 for radius-R=1R=1 stencils. The cross-section at fixed xx is

Tyz={(y,z)z0z<z0+Bz,  yy0+(zz0)Dw1}T_{yz} = \{ (y, z) \mid z_0 \leq z < z_0 + B_z,\; |y-y_0| + (z-z_0) \leq D_w-1 \}

yielding a half-diamond extruded over BzB_z layers. The total yyzz points in one tile is

Atile=Dw2/2+Dw(Bz1)A_{\text{tile}} = D_w^2/2 + D_w \cdot (B_z - 1)

(for BzDwB_z \leq D_w), so each tile covers NxAtileN_x \cdot A_{\text{tile}} grid-point updates.

The sweep proceeds such that dependencies from points (y±1,z1)(y\pm1, z-1) and (y,z1)(y, z-1) always reside “behind” the advancing wavefront, enabling correctness in concurrent updates. This geometric construction underpins both spatial and temporal reuse (Malas et al., 2015, Malas et al., 2014).

2. Multi-Dimensional Intra-Tile Parallelization and Thread Groups

Traditional single-thread diamond blocking assigns an individual tile to each thread, but MWD introduces thread groups (TGs), typically sized to match cache and core constraints, that cooperatively update one tile. Three orthogonal forms of intra-tile parallelism are exploited:

  • Division along x: TxT_x threads handle contiguous xx-slabs, optimizing cacheline stride and hardware prefetch efficacy.
  • Division across field components: For multi-component stencils (e.g. 6E6\, \vec{E} and 6H6\, \vec{H} fields in THIIM FDFD (Malas et al., 2015)), TcT_c threads are assigned subsets of these.
  • Optional division in y: Partitioning the yy-range among TyT_y threads can alleviate load imbalance.

All multi-threaded updates within a tile synchronize on each zz-layer via barriers. The end-to-end process is managed by a global FIFO of ready tiles, with dependencies dynamically scheduled:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#pragma omp parallel
{
  int tx = omp_get_thread_num() % Tx;
  int tc = omp_get_thread_num() / Tx;
  for (int z = z0; z < z0 + Bz; ++z) {
    int y_min = y0 - (Dw - 1) + (z - z0);
    int y_max = y0 + (Dw - 1) - (z - z0);
    for (int y = y_min; y <= y_max; ++y) {
      int x_begin = x0 + tx * slab_size;
      int x_end = x0 + (tx+1)*slab_size;
      for (int x = x_begin; x < x_end; ++x) {
        update_fields_component_group(x,y,z,tc);
      }
    }
    #pragma omp barrier
  }
}
Cooperation within TGs reduces per-thread cache requirements, increases arithmetic intensity, and enables scalable wavefront concurrency (Malas et al., 2015, Malas et al., 2014).

3. Memory Traffic, Code Balance, and Cache-Footprint Models

A key metric in high-performance stencil codes is code balance BCB_C (bytes per lattice update, LUP), governing the achievable memory-bound performance Pmem=bS/BCP_{\mathrm{mem}}=b_S/B_C with bSb_S as the available memory bandwidth.

MWD cache footprint models are formulated as:

  • Cache size for a single extruded diamond:

CS=Nxb[ND(Dw22+Dw(NF1))+2(Dw+Ww)]C_S = N_{xb}\left[N_D \left(\frac{D_w^2}{2} + D_w (N_F-1)\right) + 2 (D_w + W_w)\right]

for R=1R=1 stencils, where NxbN_{xb} is the xx-tile size, NDN_D the number of domain-sized arrays, and Ww=Dw+NF2W_w = D_w + N_F - 2 (Malas et al., 2014, Malas et al., 2015).

  • Code balance:

BC=16[(2Dw2)+(NDDw+2)]Dw2[bytes/LUP],(R=1)B_C = \frac{16\left[(2D_w-2)+(N_D D_w+2)\right]}{D_w^2}\quad\text{[bytes/LUP]}, \quad (R=1)

which asymptotically decreases as DwD_w increases, with a lower bound of 16ND/Dw16 N_D/D_w.

MWD drives BCB_C towards its theoretical minimum, attaining up to an order-of-magnitude reduction in bytes/LUP compared to spatial blocking (e.g., BC200B_C \sim 200–$400$ bytes/LUP vs. $1,216$ bytes/LUP for FDFD spatial block (Malas et al., 2015)) and similarly drastic reductions for other stencil schemes (Malas et al., 2014, Malas et al., 2014). DRAM power consumption correlates nearly linearly with code balance (Malas et al., 2014).

4. End-to-End Parallel Workflow and Distributed Implementation

MWD accommodates hybrid MPI+OpenMP parallelism by decomposing the domain along zz (or tt), assigning sub-zz-slabs and halo layers to MPI ranks. Within each rank, OpenMP manages tiled sweeps:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
MPI_Init...
#pragma omp parallel num_threads(P)
{
  if (omp_get_thread_num() == 0) {
    build_diamond_fifo();
  }
  #pragma omp barrier
  while (not all tiles done) {
    int tile_id = pop_fifo();
    // in-tile sweep as above
    for (int z=z0; z<z0+Bz; ++z) {
      for (int y=y_min(z); y<=y_max(z); ++y) {
        for (int x=x0; x<x0+Nx; ++x) {
          compute_H_updates(x,y,z);
          compute_E_updates(x,y,z);
        }
      }
      #pragma omp barrier
    }
    for (each dependent neighbor tile nt) {
      if (all its deps done) push_fifo(nt);
    }
  }
  #pragma omp barrier
  MPI_Isend(halo_layers,...);
  MPI_Irecv(halo_layers,...);
  MPI_Waitall(...);
}
MPI_Finalize();
Dynamic scheduling of diamonds via FIFO lists, local barriers within TGs, and distributed halo exchanges enable scalable distributed-memory MWD (Malas et al., 2015, Malas et al., 2014).

5. Practical Auto-Tuning for Parameter Selection

Efficient deployment of MWD requires judicious selection of tiling and thread group parameters. The practical auto-tuning strategy proceeds by:

  1. Cache-fit filter: Discard (Dw,Bz)(D_w, B_z) combinations where the predicted CSC_S exceeds half of L3 cache capacity.
  2. Parameter grid search: For remaining options, benchmark a small set of TG shapes (e.g., T=6,9,18T=6,9,18) for update rate.
  3. Cost model-guided refinement: Finalize (Dw,Bz,T)(D_w, B_z, T) maximizing performance per memory access (minimizing BCB_C); prioritize larger DwD_w where in-cache reuse can be maximized.

This three-step process identifies near-optimal configurations with only a few dozen trials, avoiding exhaustive full-space searches (Malas et al., 2015). Recommendations include maximizing DwD_w as feasible, setting NFN_F (=Bz=B_z) at or above the group size for concurrency, dynamic tuning of NxbN_{xb} to structure leading dimension access, and exploring TG sizes from one up to all available cores for balanced cache and parallel efficiency (Malas et al., 2014, Malas et al., 2014).

6. Performance, Energy, and Scalability Results

Experimental studies on Intel Haswell (18-core) and Ivy Bridge (10-core) platforms provide quantitative benchmarks:

  • Pure spatial blocking saturates memory bandwidth at 6–8 cores (50GB/s\approx 50\,\text{GB/s}, $40$ MLUP/s).
  • Single-threaded wavefront-diamond (SWD) shifts saturation but thrashes L3 cache and deteriorates beyond 12 cores.
  • MWD with shared cache-blocks never hits memory bandwidth limits, maintains code balance (200–400 bytes/LUP), and scales to all cores at 75%\approx75\% parallel efficiency.
  • MWD achieves $3$–4×4\times speedup versus best spatial-blocked kernels on 18 cores; DRAM bandwidth savings of 38–80% (Malas et al., 2015).
  • Roofline analysis: MWD can approach cache-bound ceilings (e.g., $6.5$ GLUP/s in-L3 for 7-point stencil) when CSC_S fits in cache; at large Dw,NFD_w,N_F, memory traffic is no longer limiting (Malas et al., 2014).
  • Energy measurements show that DRAM consumption drops near linearly with code balance; optimal energy-to-solution may not coincide with peak speed (e.g., 2WD fastest execution, 10WD lowest energy on the 7-point variable-coefficient stencil (Malas et al., 2014)).

Key empirical highlights on Intel Ivy Bridge:

Variant GLUP/s (7pt const) GLUP/s (7pt var) DRAM BW savings DRAM energy reduction
Spatial Block 1.6 1.0 Baseline Baseline
1WD 4.3 2.55 30–68% 57→22 pJ/LUP
10WD 3.8 1.15 up to 80% 77.5→70 pJ/LUP

MWD scalability has also been demonstrated in distributed-memory setups (e.g., near-ideal scaling to 16 sockets for 2WD/5WD/10WD until surface-to-volume ratio effects intervene) (Malas et al., 2014).

7. Methodological Advantages and Limitations

MWD delivers a blend of spatial and temporal blocking, exploiting shared-cache residency and multi-threaded cooperation to decouple memory bandwidth from execution performance in bandwidth-bound stencil codes. The technique markedly reduces code balance (e.g., $5$–$8$ B/LUP vs. $24$–$128$ B/LUP baseline), saves DRAM power, and accommodates fine-grained concurrency essential for strong scaling under MPI.

However, overall speedup is ultimately capped by “in-cache” performance, barrier overhead in large TGs, and cache associativity/capacity conflicts at “edge” sizes. For small domains or many-core nodes with limited diamond concurrency, TG size must be chosen accordingly. Auto-tuning remains crucial for adapting (Dw,NF,T)(D_w,N_F,T) to specific stencil types and architectures (Malas et al., 2014, Malas et al., 2014, Malas et al., 2015).

A plausible implication is further acceleration and energy gains on future architectures with more bandwidth-starved memory subsystems, since MWD effectiveness scales with cache/memory hierarchy granularity.


References:

  • "Optimization of an electromagnetics code with multicore wavefront diamond blocking and multi-dimensional intra-tile parallelization" (Malas et al., 2015)
  • "Towards energy efficiency and maximum computational intensity for stencil algorithms using wavefront diamond temporal blocking" (Malas et al., 2014)
  • "Multicore-optimized wavefront diamond blocking for optimizing stencil updates" (Malas et al., 2014)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multicore Wavefront Diamond Blocking (MWD).