Slice-and-Sandwich Pipeline
- Slice-and-sandwich pipeline is a composite loop parallelization strategy that combines DSWP and backward slicing to expose fine-grained concurrency in complex loops.
- It partitions loops into pipeline stages and refines bottleneck stages into independent slices, optimizing execution on multicore systems.
- Empirical evaluations report up to 2.4× speedup over optimized sequential code, underlining its effectiveness for loops with irregular control flow and heavy function calls.
The slice-and-sandwich pipeline is a composite automatic parallelization strategy that enhances loop-level concurrency in programs with complex dependences by applying a two-step transformation: Decoupled Software Pipeline (DSWP) followed by backward program slicing. This pipeline specifically addresses scenarios where classic loop parallelization techniques such as DOALL and DOACROSS fail to achieve significant speedups due to loop-carried dependences and function call overheads. By decomposing a complex loop into pipeline stages (DSWP) and then further partitioning the most time-consuming stage into finer, independent backward slices, the approach exposes additional parallelism, enabling superior utilization of multicore architectures and yielding measurable speedups over traditional DSWP alone (Alwan et al., 2015).
1. Core Components: DSWP and Backward Program Slicing
Decoupled Software Pipeline (DSWP) transforms the body of a single, potentially complex loop by partitioning its program dependency graph (PDG) into a sequence of stages using strongly connected components (SCCs). Each stage is mapped onto a separate software thread and operates in a pipelined fashion with lock-free producer-consumer queues. Data and control dependences that define the loop’s critical path remain within these stages, enabling concurrent execution without violating program semantics. DSWP is particularly suited to loops with irregular control flow, pointer-based data structures, or heavy function calls—cases where DOALL (requiring no loop-carried dependences) and DOACROSS (relying on fine-grained synchronization) are inapplicable or inefficient.
Backward Program Slicing extracts, for a given slicing criterion —where is a program point and is a set of variables—the minimal set of statements affecting the value of any at , under data or control dependence. Applied to a complex pipeline stage that produces several outputs, backward slicing yields independent sub-slices, each computing a portion of that stage’s outputs, forming new threads that can execute concurrently as consumers or producers in the pipeline.
2. Algorithmic Framework and Workflow
The slice-and-sandwich pipeline is implemented via the following sequence:
- Static Cost Estimation: Compute instruction and function call latencies for every loop.
- Loop Selection: Identify the loop with maximal estimated cost, containing at least one expensive function call, and unamenable to DOALL.
- PDG Construction: Build the program’s dependency graph for the selected loop.
- SCC-DAG Extraction: Collapse the PDG into a directed acyclic graph of SCCs.
- Pipeline Partitioning: Partition SCC-DAG into stages (), balancing cost per stage; assign one DSWP thread per stage.
- Bottleneck Identification: Locate the stage with maximal cost.
- Backward Slicing: Apply static intra-procedural slicing to , extracting independent slices (), each driven by different output variables.
- Pipeline Augmentation: Substitute with new threads; introduce lock-free FIFO queues between pipeline segments as necessary.
- Code Generation: Emit the parallel code with appropriate queueing, thread creation, and synchronization.
The backward slicing step is formalized via recursive traversal of PDG nodes, initiated from 's exit point and a set of slicing variables, marking instructions and following control/data dependences. Post-processing eliminates duplicate slices and reintegrates required control predicates.
3. Performance Metrics and Speedup Analysis
The throughput of a -stage DSWP pipeline is bounded by its slowest stage: , where is the cost of stage . Pipeline latency to first output is plus pipeline fill/drain overhead.
Key metrics include:
- : Execution time of optimized sequential loop.
- : Execution time with DSWP ( stages).
- : Execution time with DSWP+slice pipeline ( threads).
- Speedup Metrics:
Empirical results show that DSWP+slice achieves a typical over DSWP alone, and over the sequential baseline. This demonstrates that the additional slicing stage provides up to 60% improvement beyond what DSWP would yield on its own.
4. Experimental Infrastructure and Benchmarks
The experimental evaluation employed:
- Hardware: Intel Core i7-870 (4 physical cores, 8 hardware threads), L1i/L1d 32 KB each, L2 256 KB, L3 8 MB, 4 GB RAM.
- Compiler Infrastructure: LLVM-based frontend, IR transformations, custom DSWP+slice passes, and comparisons with gcc-4.x manual code.
- Benchmark Kernels: Five benchmarks, including artificial list-of-lists traversals (linkedlist2.c, linkedlist3.c), Fast Fourier Transform (fft.c), numerical computation (pro_2.4.c), and spherical harmonics (test0697.c). Loop sizes and iteration counts varied to expose DSWP and inter-thread communication overheads.
- Measurement Methodology: Execution time of hot loop bodies (all thread operations inlined), average of 10 runs per configuration, outliers removed, lock-free queue sizes chosen empirically to prevent producer/consumer stalls (100–1000 entries).
Representative results are summarized below.
| Kernel | LLVM-seq Time | LLVM-DSWP Time | LLVM-DSWP+slice Time | DSWP Speedup | DSWP+slice Speedup |
|---|---|---|---|---|---|
| linkedlist2.c | 1.70 ms | 1.01 ms | 0.91 ms | 1.7× | 1.9× (1.1× over DSWP) |
| fft.c | 5.47 ms | 5.39 ms | 3.01 ms | 1.02× | 1.8× (2.4× over seq) |
| test0697.c | 0.52 ms | 0.36 ms | 0.27 ms | 1.45× | 1.9× (1.33× over DSWP) |
5. Applicability and Limitations
The slice-and-sandwich pipeline is particularly effective under the following conditions:
- The target loop cannot be parallelized via DOALL due to intrinsic loop-carried dependences.
- DSWP produces an unbalanced pipeline, typically with one stage (e.g., containing expensive function calls) dominating overall execution time.
- The dominant DSWP stage naturally yields multiple, independent outputs amenable to slicing.
However, there are notable limitations:
- The overhead of inter-thread communication using lock-free queues may outweigh performance gains if slices are too fine-grained or require high-frequency data transfer.
- Static backward slicing is conservative and may include superfluous instructions, potentially limiting achievable parallelism compared to dynamic or speculative slicing.
- Backward slicing does not address loop transformations or removal of loop-carried dependences; further speedups could potentially be realized by combining slicing with unrolling or speculative techniques.
6. Synthesis and Impact
The slice-and-sandwich pipeline advances automatic parallelization methodologies by augmenting DSWP with backward slicing, dynamically restructuring an unbalanced pipeline into more fine-grained, load-balanced, multithreaded execution. In cases where traditional loop transformations are insufficient—specifically loops that resist DOALL and produce pipeline bottlenecks—this method provides up to 1.6× additional speedup beyond DSWP, and up to 2.4× over a highly optimized sequential implementation. Fully automatic implementations in frameworks such as LLVM demonstrate practical applicability to both synthetic and scientific kernels, effectively scaling real-world programs across multicore architectures (Alwan et al., 2015).