Sharding and FAA in Concurrent Stacks
- Sharding and fetch&increment (FAA) are mechanisms that partition operations in concurrent stacks, enabling efficient local elimination and minimizing global contention.
- SEC divides threads into shards where batched push/pop operations are coordinated via FAA-assigned sequence numbers for effective elimination and combining.
- Experimental results indicate SEC achieves up to 2.5× throughput improvement by reducing costly CAS operations and optimizing parallel batch processing.
Sharding and fetch&increment (FAA) are central design mechanisms in highly efficient concurrent stack implementations, especially in the context of high-contention, many-threaded systems. The SEC (Sharded Elimination and Combining) stack exemplifies the integration of these two techniques to mitigate the bottlenecks inherent in classic concurrent stacks such as the Treiber stack, particularly the contention on the global top pointer. By partitioning contention, orchestrating operations in synchronized batches, and utilizing atomic fetch&increment in place of more costly synchronization, SEC achieves near-linear scalability and significantly outperforms previous concurrent stack algorithms in a wide variety of multi-core environments (Singh et al., 8 Jan 2026).
1. System Architecture and Operation
SEC divides the entire stack system into s independent aggregators or shards. Each aggregator exclusively serves a set of threads, determined by round-robin assignment or NUMA-aware binding (e.g., ). Threads within a shard participate in forming batches of push/pop operations, coordinated by a single “freezer” thread per batch. The batch mechanism orchestrates operation synchronization, enforcing collective progression through announcement via FAA counters and batch “freezing.” Once frozen, a batch is divided into two distinct operational paths: elimination and combining.
- Elimination: Matched numbers of push and pop requests (up to for pushes and pops) are paired and resolved locally within the shard, never inducing contention on the shared stack top.
- Combining: Any surplus (unmatched) operations within a batch are executed by a “combiner” thread, which splices or removes entire sublists from the shared stack via a single CAS operation.
Through this multi-aggregator, batched architecture with FAA-based counting and maximum local elimination, SEC disperses contention and minimizes interference with the global stack, effectively bounding the scalability limitations of earlier approaches (Singh et al., 8 Jan 2026).
2. Integration and Semantics of Fetch&Increment
Fetch&increment (FAA), implemented as fetch_inc, is used in both push and pop operations to atomically acquire a unique sequence number within a batch and to coordinate the inclusion of each operation in the batch. The critical properties and usage instances are as follows:
- Push: Each invocation calls
fetch_inc(&B.pushCount), with the return value determining the operation’s batch sequence slot and position in the elimination array. - Pop: Analogously,
fetch_inc(&B.popCount)is invoked to reserve a sequence position for the pop.
FAA eliminates the need for CAS-based admission or elimination in the common case; only two FAA per operation are necessary. The sequence numbers directly support elimination (pairwise cancellation for identical sequence indices) and establish boundaries for combining operations, as determined during freezing ( and ). Batch singularity is maintained by atomic test-and-set on a batch flag to ensure exactly one freezer per batch (Singh et al., 8 Jan 2026).
3. Batch-Level Elimination and Combining
SEC’s elimination process occurs strictly within a batch. After all batch members have announced via FAA, the freezer snapshots the push and pop counters, determining the maximal number of eliminatable pairs. Each push-pop pair can remove itself from contention with the global stack. The remaining surplus of a single operation type (push or pop) is handled by the combiner—actions include bulk-adding a chain of nodes to the shared stack (for push surplus) or bulk-removal of nodes (for pop surplus), each performed with a single CAS. This mechanism, combined with sharding, ensures that:
- Most operations are eliminated locally (70–85% in observed workloads), reducing CAS pressure and pointer traffic to the shared stack.
- The number of operations that modify the global top pointer is sharply reduced.
- Multiple aggregators allow for parallel combiners, maximizing throughput by overlapping bulk updates (Singh et al., 8 Jan 2026).
4. Performance Modeling and Experimental Results
Measured throughput in SEC scales nearly linearly with the number of shards until the per-shard thread count drops to a threshold below which batching benefits diminish. The empirical throughput model is given by:
with reflecting system-level coherence and CAS-retry penalties, and representing contention/load factors per shard. For over-subscribed systems (e.g., 56 threads on a 2-socket, 24-core machine), optimal is often 2–3, balancing per-shard contention and batch size. Increasing excessively can fragment workload, diminishing batch size and the benefits of elimination and combining (Singh et al., 8 Jan 2026).
Experimental highlights include:
- Under balanced update workloads, SEC attains up to the throughput of elimination-backoff or flat-combining stacks.
- Push-only scenarios yield the speed of Treiber and that of time-stamped stacks.
- FAA local elimination covers 70–85% of operations; only 15–30% reach the global pointer, and those do so in bulk.
- For 56 threads, produces a speed-up over .
- Sharding above reduces overall throughput due to insufficient batch formation (Singh et al., 8 Jan 2026).
5. Complexity and Linearizability Analysis
Let denote threads per shard. Complexity components per operation are:
- FAA announcements: amortized per-thread (assuming bounded contention).
- Freezing: as only one thread per batch performs it; others synchronize via spin-wait.
- Elimination/combining: by design—each thread inspects simple counters; combiners loop over bounded batch segments.
- Bulk CAS: at the combiner; other threads spin-wait.
The stack is blocking (not lock-free): progress is guaranteed unless all threads stall before combining is performed. In pathological scenarios, a thread may retry indefinitely, yielding in theory, but practical starvation is prevented under standard contention profiles (Singh et al., 8 Jan 2026).
Memory overhead scales as : per shard, batch objects have state, and total system state is . Linearizability is explicitly maintained:
- Eliminated pairs: linearize upon pop’s completion (array read).
- Combined operations: linearize at the successful global CAS.
6. Tuning and Deployment Considerations
Practical deployment is guided by several empirically grounded rules:
- Shard count (): Should match the number of NUMA sockets, or on single-socket, many-core CPUs.
- Thread-to-shard mapping: Round-robin assignment suffices generally; NUMA awareness can further enhance performance.
- FAA configuration: Requires no parameterization; standard atomic FAA suffices.
- Batch “freeze” timing: Optional micro-pause () prior to freezing can slightly inflate batch size and elimination rate but is nonessential for achieving best-in-class performance.
- Combiner parallelism: Unlike flat-combining, SEC supports one combiner per shard and batch, leveraging parallel bulk modification and further reducing contention (Singh et al., 8 Jan 2026).
| Parameter | Optimization Strategy | Impact |
|---|---|---|
| Shard count | , or NUMA | Maximizes locality, minimizes contention |
| Thread mapping | Round-robin/NUMA binding | Enhances batching/elimination |
| Batch freeze | Optional short backoff | Slightly enhances batch size |
SEC’s architectural innovations, rooted in sharding and FAA, constitute a distinct advancement in concurrent stack performance under contention, specifically by separating local coordination from global structure and leveraging parallel batch operations. These characteristics are thoroughly evaluated and detailed in (Singh et al., 8 Jan 2026).