Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sharding and FAA in Concurrent Stacks

Updated 15 January 2026
  • Sharding and fetch&increment (FAA) are mechanisms that partition operations in concurrent stacks, enabling efficient local elimination and minimizing global contention.
  • SEC divides threads into shards where batched push/pop operations are coordinated via FAA-assigned sequence numbers for effective elimination and combining.
  • Experimental results indicate SEC achieves up to 2.5× throughput improvement by reducing costly CAS operations and optimizing parallel batch processing.

Sharding and fetch&increment (FAA) are central design mechanisms in highly efficient concurrent stack implementations, especially in the context of high-contention, many-threaded systems. The SEC (Sharded Elimination and Combining) stack exemplifies the integration of these two techniques to mitigate the bottlenecks inherent in classic concurrent stacks such as the Treiber stack, particularly the contention on the global top pointer. By partitioning contention, orchestrating operations in synchronized batches, and utilizing atomic fetch&increment in place of more costly synchronization, SEC achieves near-linear scalability and significantly outperforms previous concurrent stack algorithms in a wide variety of multi-core environments (Singh et al., 8 Jan 2026).

1. System Architecture and Operation

SEC divides the entire stack system into s independent aggregators or shards. Each aggregator exclusively serves a set of threads, determined by round-robin assignment or NUMA-aware binding (e.g., Aagg[tidmods]A \leftarrow \text{agg}[tid \bmod s]). Threads within a shard participate in forming batches of push/pop operations, coordinated by a single “freezer” thread per batch. The batch mechanism orchestrates operation synchronization, enforcing collective progression through announcement via FAA counters and batch “freezing.” Once frozen, a batch is divided into two distinct operational paths: elimination and combining.

  • Elimination: Matched numbers of push and pop requests (up to PQP \wedge Q for PP pushes and QQ pops) are paired and resolved locally within the shard, never inducing contention on the shared stack top.
  • Combining: Any surplus (unmatched) operations within a batch are executed by a “combiner” thread, which splices or removes entire sublists from the shared stack via a single CAS operation.

Through this multi-aggregator, batched architecture with FAA-based counting and maximum local elimination, SEC disperses contention and minimizes interference with the global stack, effectively bounding the scalability limitations of earlier approaches (Singh et al., 8 Jan 2026).

2. Integration and Semantics of Fetch&Increment

Fetch&increment (FAA), implemented as fetch_inc, is used in both push and pop operations to atomically acquire a unique sequence number within a batch and to coordinate the inclusion of each operation in the batch. The critical properties and usage instances are as follows:

  • Push: Each invocation calls fetch_inc(&B.pushCount), with the return value determining the operation’s batch sequence slot and position in the elimination array.
  • Pop: Analogously, fetch_inc(&B.popCount) is invoked to reserve a sequence position for the pop.

FAA eliminates the need for CAS-based admission or elimination in the common case; only two FAA per operation are necessary. The sequence numbers directly support elimination (pairwise cancellation for identical sequence indices) and establish boundaries for combining operations, as determined during freezing (B.pushCountAtFreezeB.pushCountAtFreeze and B.popCountAtFreezeB.popCountAtFreeze). Batch singularity is maintained by atomic test-and-set on a batch flag to ensure exactly one freezer per batch (Singh et al., 8 Jan 2026).

3. Batch-Level Elimination and Combining

SEC’s elimination process occurs strictly within a batch. After all batch members have announced via FAA, the freezer snapshots the push and pop counters, determining the maximal number of eliminatable pairs. Each push-pop pair (seq<min(B.pushCountAtFreeze,B.popCountAtFreeze))(\text{seq} < \min(B.pushCountAtFreeze, B.popCountAtFreeze)) can remove itself from contention with the global stack. The remaining surplus of a single operation type (push or pop) is handled by the combiner—actions include bulk-adding a chain of nodes to the shared stack (for push surplus) or bulk-removal of nodes (for pop surplus), each performed with a single CAS. This mechanism, combined with sharding, ensures that:

  • Most operations are eliminated locally (70–85% in observed workloads), reducing CAS pressure and pointer traffic to the shared stack.
  • The number of operations that modify the global top pointer is sharply reduced.
  • Multiple aggregators allow for parallel combiners, maximizing throughput by overlapping bulk updates (Singh et al., 8 Jan 2026).

4. Performance Modeling and Experimental Results

Measured throughput in SEC scales nearly linearly with the number of shards ss until the per-shard thread count n/sn/s drops to a threshold below which batching benefits diminish. The empirical throughput model is given by:

T(n,s,c)n1+α(n/s)cT(n,s,c) \approx \frac{n}{1 + \alpha (n/s) c}

with α\alpha reflecting system-level coherence and CAS-retry penalties, and cc representing contention/load factors per shard. For over-subscribed systems (e.g., 56 threads on a 2-socket, 24-core machine), optimal ss is often 2–3, balancing per-shard contention and batch size. Increasing ss excessively can fragment workload, diminishing batch size and the benefits of elimination and combining (Singh et al., 8 Jan 2026).

Experimental highlights include:

  • Under balanced update workloads, SEC attains up to 2.5×2.5\times the throughput of elimination-backoff or flat-combining stacks.
  • Push-only scenarios yield 2×2\times the speed of Treiber and 6×6\times that of time-stamped stacks.
  • FAA local elimination covers 70–85% of operations; only 15–30% reach the global pointer, and those do so in bulk.
  • For 56 threads, s=2s=2 produces a 2.3×2.3\times speed-up over s=1s=1.
  • Sharding above s>4s>4 reduces overall throughput due to insufficient batch formation (Singh et al., 8 Jan 2026).

5. Complexity and Linearizability Analysis

Let P=n/sP = n/s denote threads per shard. Complexity components per operation are:

  • FAA announcements: O(1)O(1) amortized per-thread (assuming bounded contention).
  • Freezing: O(1)O(1) as only one thread per batch performs it; others synchronize via spin-wait.
  • Elimination/combining: O(1)O(1) by design—each thread inspects simple counters; combiners loop over bounded batch segments.
  • Bulk CAS: O(1)O(1) at the combiner; other threads spin-wait.

The stack is blocking (not lock-free): progress is guaranteed unless all threads stall before combining is performed. In pathological scenarios, a thread may retry indefinitely, yielding O()O(\infty) in theory, but practical starvation is prevented under standard contention profiles (Singh et al., 8 Jan 2026).

Memory overhead scales as O(n)O(n): per shard, batch objects have O(P)O(P) state, and total system state is O(sP)=O(n)O(sP) = O(n). Linearizability is explicitly maintained:

  • Eliminated pairs: linearize upon pop’s completion (array read).
  • Combined operations: linearize at the successful global CAS.

6. Tuning and Deployment Considerations

Practical deployment is guided by several empirically grounded rules:

  1. Shard count (ss): Should match the number of NUMA sockets, or 2s42 \leq s \leq 4 on single-socket, many-core CPUs.
  2. Thread-to-shard mapping: Round-robin assignment suffices generally; NUMA awareness can further enhance performance.
  3. FAA configuration: Requires no parameterization; standard atomic FAA suffices.
  4. Batch “freeze” timing: Optional micro-pause (pause()pause()) prior to freezing can slightly inflate batch size and elimination rate but is nonessential for achieving best-in-class performance.
  5. Combiner parallelism: Unlike flat-combining, SEC supports one combiner per shard and batch, leveraging parallel bulk modification and further reducing contention (Singh et al., 8 Jan 2026).
Parameter Optimization Strategy Impact
Shard count 2s42 \leq s \leq 4, or \approx NUMA Maximizes locality, minimizes contention
Thread mapping Round-robin/NUMA binding Enhances batching/elimination
Batch freeze Optional short backoff Slightly enhances batch size

SEC’s architectural innovations, rooted in sharding and FAA, constitute a distinct advancement in concurrent stack performance under contention, specifically by separating local coordination from global structure and leveraging parallel batch operations. These characteristics are thoroughly evaluated and detailed in (Singh et al., 8 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sharding and Fetch&Increment.