Papers
Topics
Authors
Recent
Search
2000 character limit reached

Slice-and-Sandwich Pipeline

Updated 20 January 2026
  • Slice-and-sandwich pipeline is a composite loop parallelization strategy that combines DSWP and backward slicing to expose fine-grained concurrency in complex loops.
  • It partitions loops into pipeline stages and refines bottleneck stages into independent slices, optimizing execution on multicore systems.
  • Empirical evaluations report up to 2.4× speedup over optimized sequential code, underlining its effectiveness for loops with irregular control flow and heavy function calls.

The slice-and-sandwich pipeline is a composite automatic parallelization strategy that enhances loop-level concurrency in programs with complex dependences by applying a two-step transformation: Decoupled Software Pipeline (DSWP) followed by backward program slicing. This pipeline specifically addresses scenarios where classic loop parallelization techniques such as DOALL and DOACROSS fail to achieve significant speedups due to loop-carried dependences and function call overheads. By decomposing a complex loop into pipeline stages (DSWP) and then further partitioning the most time-consuming stage into finer, independent backward slices, the approach exposes additional parallelism, enabling superior utilization of multicore architectures and yielding measurable speedups over traditional DSWP alone (Alwan et al., 2015).

1. Core Components: DSWP and Backward Program Slicing

Decoupled Software Pipeline (DSWP) transforms the body of a single, potentially complex loop by partitioning its program dependency graph (PDG) into a sequence of stages using strongly connected components (SCCs). Each stage is mapped onto a separate software thread and operates in a pipelined fashion with lock-free producer-consumer queues. Data and control dependences that define the loop’s critical path remain within these stages, enabling concurrent execution without violating program semantics. DSWP is particularly suited to loops with irregular control flow, pointer-based data structures, or heavy function calls—cases where DOALL (requiring no loop-carried dependences) and DOACROSS (relying on fine-grained synchronization) are inapplicable or inefficient.

Backward Program Slicing extracts, for a given slicing criterion (n,V)(n, V)—where nn is a program point and VV is a set of variables—the minimal set of statements affecting the value of any vVv \in V at nn, under data or control dependence. Applied to a complex pipeline stage that produces several outputs, backward slicing yields independent sub-slices, each computing a portion of that stage’s outputs, forming new threads that can execute concurrently as consumers or producers in the pipeline.

2. Algorithmic Framework and Workflow

The slice-and-sandwich pipeline is implemented via the following sequence:

  1. Static Cost Estimation: Compute instruction and function call latencies for every loop.
  2. Loop Selection: Identify the loop with maximal estimated cost, containing at least one expensive function call, and unamenable to DOALL.
  3. PDG Construction: Build the program’s dependency graph for the selected loop.
  4. SCC-DAG Extraction: Collapse the PDG into a directed acyclic graph of SCCs.
  5. Pipeline Partitioning: Partition SCC-DAG into kk stages (k#coresk\leq \text{\#cores}), balancing cost per stage; assign one DSWP thread per stage.
  6. Bottleneck Identification: Locate the stage SS^\star with maximal cost.
  7. Backward Slicing: Apply static intra-procedural slicing to SS^\star, extracting mm independent slices (S1,,SmS_1,\ldots,S_m), each driven by different output variables.
  8. Pipeline Augmentation: Substitute SS^\star with mm new threads; introduce lock-free FIFO queues between pipeline segments as necessary.
  9. Code Generation: Emit the parallel code with appropriate queueing, thread creation, and synchronization.

The backward slicing step is formalized via recursive traversal of PDG nodes, initiated from SS^\star's exit point and a set of slicing variables, marking instructions and following control/data dependences. Post-processing eliminates duplicate slices and reintegrates required control predicates.

3. Performance Metrics and Speedup Analysis

The throughput of a kk-stage DSWP pipeline is bounded by its slowest stage: Tpds=1/maxi(Ci)T_{p_{ds}} = 1 / \max_i(C_i), where CiC_i is the cost of stage ii. Pipeline latency to first output is Lp=iCiL_p = \sum_i C_i plus pipeline fill/drain overhead.

Key metrics include:

  • TseqT_{seq}: Execution time of optimized sequential loop.
  • TdT_d: Execution time with DSWP (kk stages).
  • TsT_s: Execution time with DSWP+slice pipeline (k+m1k+m-1 threads).
  • Speedup Metrics:
    • SpeedupDSWP=Tseq/TdSpeedup_{DSWP} = T_{seq}/T_d
    • Speedupslice=Tseq/TsSpeedup_{slice} = T_{seq}/T_s
    • Speedupadditive=Td/Ts=(Tseq/Ts)/(Tseq/Td)=Speedupslice/SpeedupDSWPSpeedup_{additive} = T_d/T_s = (T_{seq}/T_s)/(T_{seq}/T_d) = Speedup_{slice}/Speedup_{DSWP}

Empirical results show that DSWP+slice achieves a typical Speedupadditive1.6Speedup_{additive} \approx 1.6 over DSWP alone, and Speedupslice2.4Speedup_{slice} \approx 2.4 over the sequential baseline. This demonstrates that the additional slicing stage provides up to 60% improvement beyond what DSWP would yield on its own.

4. Experimental Infrastructure and Benchmarks

The experimental evaluation employed:

  • Hardware: Intel Core i7-870 (4 physical cores, 8 hardware threads), L1i/L1d 32 KB each, L2 256 KB, L3 8 MB, 4 GB RAM.
  • Compiler Infrastructure: LLVM-based frontend, IR transformations, custom DSWP+slice passes, and comparisons with gcc-4.x manual code.
  • Benchmark Kernels: Five benchmarks, including artificial list-of-lists traversals (linkedlist2.c, linkedlist3.c), Fast Fourier Transform (fft.c), numerical computation (pro_2.4.c), and spherical harmonics (test0697.c). Loop sizes and iteration counts varied to expose DSWP and inter-thread communication overheads.
  • Measurement Methodology: Execution time of hot loop bodies (all thread operations inlined), average of 10 runs per configuration, outliers removed, lock-free queue sizes chosen empirically to prevent producer/consumer stalls (100–1000 entries).

Representative results are summarized below.

Kernel LLVM-seq Time LLVM-DSWP Time LLVM-DSWP+slice Time DSWP Speedup DSWP+slice Speedup
linkedlist2.c 1.70 ms 1.01 ms 0.91 ms 1.7× 1.9× (1.1× over DSWP)
fft.c 5.47 ms 5.39 ms 3.01 ms 1.02× 1.8× (2.4× over seq)
test0697.c 0.52 ms 0.36 ms 0.27 ms 1.45× 1.9× (1.33× over DSWP)

5. Applicability and Limitations

The slice-and-sandwich pipeline is particularly effective under the following conditions:

  • The target loop cannot be parallelized via DOALL due to intrinsic loop-carried dependences.
  • DSWP produces an unbalanced pipeline, typically with one stage (e.g., containing expensive function calls) dominating overall execution time.
  • The dominant DSWP stage naturally yields multiple, independent outputs amenable to slicing.

However, there are notable limitations:

  • The overhead of inter-thread communication using lock-free queues may outweigh performance gains if slices are too fine-grained or require high-frequency data transfer.
  • Static backward slicing is conservative and may include superfluous instructions, potentially limiting achievable parallelism compared to dynamic or speculative slicing.
  • Backward slicing does not address loop transformations or removal of loop-carried dependences; further speedups could potentially be realized by combining slicing with unrolling or speculative techniques.

6. Synthesis and Impact

The slice-and-sandwich pipeline advances automatic parallelization methodologies by augmenting DSWP with backward slicing, dynamically restructuring an unbalanced pipeline into more fine-grained, load-balanced, multithreaded execution. In cases where traditional loop transformations are insufficient—specifically loops that resist DOALL and produce pipeline bottlenecks—this method provides up to 1.6× additional speedup beyond DSWP, and up to 2.4× over a highly optimized sequential implementation. Fully automatic implementations in frameworks such as LLVM demonstrate practical applicability to both synthetic and scientific kernels, effectively scaling real-world programs across multicore architectures (Alwan et al., 2015).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slice-and-Sandwich Pipeline.