Multicore Wavefront Diamond Blocking
- Multicore Wavefront Diamond Blocking (MWD) is an advanced temporal-blocking technique for stencil computations that fuses diamond tiling and synchronized wavefront updates.
- It significantly reduces code balance and DRAM bandwidth usage by enhancing cache reuse and increasing arithmetic intensity, leading to substantial energy savings.
- MWD employs multi-dimensional intra-tile parallelization with cooperative thread groups, enabling efficient scaling on multicore and distributed-memory architectures.
Multicore Wavefront Diamond Blocking (MWD) is an advanced temporal-blocking technique for stencil-based numerical computations targeting multicore architectures with multilevel shared caches. MWD fuses diamond tiling for spatial-temporal locality and multicore-aware synchronized wavefront updates, enabling multi-threaded execution with shared cache block residency. This method has proven effective in addressing the memory bandwidth bottleneck fundamental to low-arithmetic-intensity stencil codes, such as those encountered in finite-difference electromagnetics, Jacobi, and variable-coefficient operators. MWD achieves significant reductions in code balance, enhances computational intensity, and offers substantial energy savings through DRAM bandwidth minimization.
1. Diamond-Wavefront Tile Geometry
MWD partitions the computational domain into extruded diamond tiles and advances updates via synchronized wavefront sweeps. Each diamond is defined by its width in the – plane and wavefront depth ( in some formulations) in the (or time) direction, with wavefront width for radius- stencils. The cross-section at fixed is
yielding a half-diamond extruded over layers. The total – points in one tile is
(for ), so each tile covers grid-point updates.
The sweep proceeds such that dependencies from points and always reside “behind” the advancing wavefront, enabling correctness in concurrent updates. This geometric construction underpins both spatial and temporal reuse (Malas et al., 2015, Malas et al., 2014).
2. Multi-Dimensional Intra-Tile Parallelization and Thread Groups
Traditional single-thread diamond blocking assigns an individual tile to each thread, but MWD introduces thread groups (TGs), typically sized to match cache and core constraints, that cooperatively update one tile. Three orthogonal forms of intra-tile parallelism are exploited:
- Division along x: threads handle contiguous -slabs, optimizing cacheline stride and hardware prefetch efficacy.
- Division across field components: For multi-component stencils (e.g. and fields in THIIM FDFD (Malas et al., 2015)), threads are assigned subsets of these.
- Optional division in y: Partitioning the -range among threads can alleviate load imbalance.
All multi-threaded updates within a tile synchronize on each -layer via barriers. The end-to-end process is managed by a global FIFO of ready tiles, with dependencies dynamically scheduled:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
#pragma omp parallel { int tx = omp_get_thread_num() % Tx; int tc = omp_get_thread_num() / Tx; for (int z = z0; z < z0 + Bz; ++z) { int y_min = y0 - (Dw - 1) + (z - z0); int y_max = y0 + (Dw - 1) - (z - z0); for (int y = y_min; y <= y_max; ++y) { int x_begin = x0 + tx * slab_size; int x_end = x0 + (tx+1)*slab_size; for (int x = x_begin; x < x_end; ++x) { update_fields_component_group(x,y,z,tc); } } #pragma omp barrier } } |
3. Memory Traffic, Code Balance, and Cache-Footprint Models
A key metric in high-performance stencil codes is code balance (bytes per lattice update, LUP), governing the achievable memory-bound performance with as the available memory bandwidth.
MWD cache footprint models are formulated as:
- Cache size for a single extruded diamond:
for stencils, where is the -tile size, the number of domain-sized arrays, and (Malas et al., 2014, Malas et al., 2015).
- Code balance:
which asymptotically decreases as increases, with a lower bound of .
MWD drives towards its theoretical minimum, attaining up to an order-of-magnitude reduction in bytes/LUP compared to spatial blocking (e.g., –$400$ bytes/LUP vs. $1,216$ bytes/LUP for FDFD spatial block (Malas et al., 2015)) and similarly drastic reductions for other stencil schemes (Malas et al., 2014, Malas et al., 2014). DRAM power consumption correlates nearly linearly with code balance (Malas et al., 2014).
4. End-to-End Parallel Workflow and Distributed Implementation
MWD accommodates hybrid MPI+OpenMP parallelism by decomposing the domain along (or ), assigning sub--slabs and halo layers to MPI ranks. Within each rank, OpenMP manages tiled sweeps:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
MPI_Init... #pragma omp parallel num_threads(P) { if (omp_get_thread_num() == 0) { build_diamond_fifo(); } #pragma omp barrier while (not all tiles done) { int tile_id = pop_fifo(); // in-tile sweep as above for (int z=z0; z<z0+Bz; ++z) { for (int y=y_min(z); y<=y_max(z); ++y) { for (int x=x0; x<x0+Nx; ++x) { compute_H_updates(x,y,z); compute_E_updates(x,y,z); } } #pragma omp barrier } for (each dependent neighbor tile nt) { if (all its deps done) push_fifo(nt); } } #pragma omp barrier MPI_Isend(halo_layers,...); MPI_Irecv(halo_layers,...); MPI_Waitall(...); } MPI_Finalize(); |
5. Practical Auto-Tuning for Parameter Selection
Efficient deployment of MWD requires judicious selection of tiling and thread group parameters. The practical auto-tuning strategy proceeds by:
- Cache-fit filter: Discard combinations where the predicted exceeds half of L3 cache capacity.
- Parameter grid search: For remaining options, benchmark a small set of TG shapes (e.g., ) for update rate.
- Cost model-guided refinement: Finalize maximizing performance per memory access (minimizing ); prioritize larger where in-cache reuse can be maximized.
This three-step process identifies near-optimal configurations with only a few dozen trials, avoiding exhaustive full-space searches (Malas et al., 2015). Recommendations include maximizing as feasible, setting () at or above the group size for concurrency, dynamic tuning of to structure leading dimension access, and exploring TG sizes from one up to all available cores for balanced cache and parallel efficiency (Malas et al., 2014, Malas et al., 2014).
6. Performance, Energy, and Scalability Results
Experimental studies on Intel Haswell (18-core) and Ivy Bridge (10-core) platforms provide quantitative benchmarks:
- Pure spatial blocking saturates memory bandwidth at 6–8 cores (, $40$ MLUP/s).
- Single-threaded wavefront-diamond (SWD) shifts saturation but thrashes L3 cache and deteriorates beyond 12 cores.
- MWD with shared cache-blocks never hits memory bandwidth limits, maintains code balance (200–400 bytes/LUP), and scales to all cores at parallel efficiency.
- MWD achieves $3$– speedup versus best spatial-blocked kernels on 18 cores; DRAM bandwidth savings of 38–80% (Malas et al., 2015).
- Roofline analysis: MWD can approach cache-bound ceilings (e.g., $6.5$ GLUP/s in-L3 for 7-point stencil) when fits in cache; at large , memory traffic is no longer limiting (Malas et al., 2014).
- Energy measurements show that DRAM consumption drops near linearly with code balance; optimal energy-to-solution may not coincide with peak speed (e.g., 2WD fastest execution, 10WD lowest energy on the 7-point variable-coefficient stencil (Malas et al., 2014)).
Key empirical highlights on Intel Ivy Bridge:
| Variant | GLUP/s (7pt const) | GLUP/s (7pt var) | DRAM BW savings | DRAM energy reduction |
|---|---|---|---|---|
| Spatial Block | 1.6 | 1.0 | Baseline | Baseline |
| 1WD | 4.3 | 2.55 | 30–68% | 57→22 pJ/LUP |
| 10WD | 3.8 | 1.15 | up to 80% | 77.5→70 pJ/LUP |
MWD scalability has also been demonstrated in distributed-memory setups (e.g., near-ideal scaling to 16 sockets for 2WD/5WD/10WD until surface-to-volume ratio effects intervene) (Malas et al., 2014).
7. Methodological Advantages and Limitations
MWD delivers a blend of spatial and temporal blocking, exploiting shared-cache residency and multi-threaded cooperation to decouple memory bandwidth from execution performance in bandwidth-bound stencil codes. The technique markedly reduces code balance (e.g., $5$–$8$ B/LUP vs. $24$–$128$ B/LUP baseline), saves DRAM power, and accommodates fine-grained concurrency essential for strong scaling under MPI.
However, overall speedup is ultimately capped by “in-cache” performance, barrier overhead in large TGs, and cache associativity/capacity conflicts at “edge” sizes. For small domains or many-core nodes with limited diamond concurrency, TG size must be chosen accordingly. Auto-tuning remains crucial for adapting to specific stencil types and architectures (Malas et al., 2014, Malas et al., 2014, Malas et al., 2015).
A plausible implication is further acceleration and energy gains on future architectures with more bandwidth-starved memory subsystems, since MWD effectiveness scales with cache/memory hierarchy granularity.
References:
- "Optimization of an electromagnetics code with multicore wavefront diamond blocking and multi-dimensional intra-tile parallelization" (Malas et al., 2015)
- "Towards energy efficiency and maximum computational intensity for stencil algorithms using wavefront diamond temporal blocking" (Malas et al., 2014)
- "Multicore-optimized wavefront diamond blocking for optimizing stencil updates" (Malas et al., 2014)