- The paper introduces the Nezha protocol, which employs deadline-ordered multicast with synchronized clocks to achieve a median throughput improvement of 5.4x.
- It incorporates speculative execution and stateless proxies to minimize client latency and distribute processing load effectively.
- Empirical evaluations show Nezha consistently outperforms baselines like Multi-Paxos, Raft, and NOPaxos, enabling practical consensus in tenant-driven cloud environments.
The paper introduces Nezha, a consensus protocol designed specifically to enhance both performance and deployability for cloud tenants without requiring support from cloud providers. Nezha bridges the gap between practical deployed protocols like Multi-Paxos and Raft and high-performance protocols like NOPaxos that demand network features generally unavailable to tenants. Central to Nezha's operation is a novel multicast primitive known as deadline-ordered multicast (DOM), which uses high-accuracy software clock synchronization to manage message delivery among replicas efficiently.
Core Components and Methodology
Nezha presents solutions to two persistent challenges in consensus protocols: achieving high throughput and low latency while still operating under the constraints of public cloud infrastructure, where tenants often lack advanced network features like programmable switches. To address this, Nezha introduces three innovative tactics:
- Deadline-Ordered Multicast (DOM): Nezha capitalizes on DOM to maintain consistent message ordering across all replicas, utilizing clock synchronization to tag messages with deadlines. This serves to enhance consistency without compromising set equality, a significant leap given the constraints of typical cloud environments.
- Speculative Execution: Nezha decouples the execution of requests from commitment, allowing leaders to execute requests speculatively and reduce client latency. This requires only a minor rerun of the request if necessary, allowing for greater efficiency compared to existing methods.
- Stateless Proxies: By using proxies to handle much of the consensus process burdens—such as multicasting, clock synchronization, and quorum checks—Nezha alleviates load from client systems. This approach supports scalability without compromising performance benefits, accommodating a wider adoption across cloud services.
Empirical Evaluation
Nezha has been tested against multiple baseline protocols, including Multi-Paxos, Fast Paxos, Raft, and Optimized NOPaxos, as well as newer protocols like Domino and TOQ-EPaxos. Across various workloads—both closed-loop and open-loop—Nezha consistently outperforms these alternatives by a significant margin. Particularly in terms of throughput, Nezha shows a median improvement of 5.4 times and a range of improvement spanning from 1.9 to 20.9 times. In latency terms, Nezha efficiently minimizes this critical performance metric across the board except when compared against highly optimized versions of competitor protocols used in narrow scenarios.
Implications and Future Work
Nezha's introduction of effective clock synchronizations as a performance enhancement—independent of correctness guarantees—presents potential for future protocol enhancements. The adaptability of Nezha suggests a meaningful shift towards tenant-dominant control over consensus-based applications, enabling more extensive deployments of fault-tolerant systems without significant cloud provider intervention.
Looking forward, possible developments may include integrating Nezha into existing system architectures such as Kubernetes for comprehensive performance evaluations and exploring the roles of DOM in decentralized resource management across distributed systems. Additionally, further refinement of Nezha to achieve seamless integration and practical deployment across varied environments and workloads remains a promising area of research.
In conclusion, the Nezha protocol offers a balanced and practical solution to existing challenges in consensus protocol deployment for cloud tenants, making it a noteworthy addition to the landscape of distributed systems research and cloud computing.