Leveraging ASIC AI Chips for Homomorphic Encryption

Published 13 Jan 2025 in cs.CR, cs.AR, cs.CL, and cs.PL | (2501.07047v2)

Abstract: Cloud-based services are making the outsourcing of sensitive client data increasingly common. Although homomorphic encryption (HE) offers strong privacy guarantee, it requires substantially more resources than computing on plaintext, often leading to unacceptably large latencies in getting the results. HE accelerators have emerged to mitigate this latency issue, but with the high cost of ASICs. In this paper we show that HE primitives can be converted to AI operators and accelerated on existing ASIC AI accelerators, like TPUs, which are already widely deployed in the cloud. Adapting such accelerators for HE requires (1) supporting modular multiplication, (2) high-precision arithmetic in software, and (3) efficient mapping on matrix engines. We introduce the CROSS compiler (1) to adopt Barrett reduction to provide modular reduction support using multiplier and adder, (2) Basis Aligned Transformation (BAT) to convert high-precision multiplication as low-precision matrix-vector multiplication, (3) Matrix Aligned Transformation (MAT) to covert vectorized modular operation with reduction into matrix multiplication that can be efficiently processed on 2D spatial matrix engine. Our evaluation of CROSS on a Google TPUv4 demonstrates significant performance improvements, with up to 161x and 5x speedup compared to the previous work on many-core CPUs and V100. The kernel-level codes are open-sourced at https://github.com/google/jaxite/tree/main/jaxite_word.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper demonstrates mapping HE primitives to ASIC AI accelerators, achieving several orders of magnitude speedup over CPUs.
It employs Barrett reduction, chunk-decomposition, and matrix transformations to adapt TPUs for high-precision HE arithmetic.
The study highlights scalable performance improvements that enable secure and cost-effective cloud data processing for HE tasks.

Leveraging ASIC AI Chips for Homomorphic Encryption: A Critical Evaluation

The paper under review presents a method to exploit existing ASIC AI accelerators, specifically Google's Tensor Processing Units (TPUs), to accelerate Homomorphic Encryption (HE) processes. Utilizing the wide deployment of TPUs in cloud environments, this works aims to bridge the performance gap between HE tasks and traditional processing platforms without the need for specialized hardware modifications.

Overview of Homomorphic Encryption and Challenges

Homomorphic Encryption allows operations on encrypted data, promising strong privacy guarantees for sensitive information processed in untrusted environments. Despite its advantages, HE incurs substantial computational overhead due to the enlarged data size after encryption and increased computational complexity—often requiring hundreds to thousands of times more resources than computation over plaintext. These overheads make it imperative to seek performance enhancements through hardware acceleration.

Custom ASICs designed for HE, like CraterLake, showcase significant performance improvements over general-purpose hardware. However, the economic and temporal costs associated with custom ASIC development are prohibitive, driving the exploration of using widely available AI accelerators, which already exhibit high parallelism and substantial on-chip memory—key factors beneficial for HE workloads.

Proposed Approach: Mapping HE Primitives to AI Acceleration

The authors detail a methodology for adapting existing AI accelerators to process HE efficiently. Three main strategies are introduced:

Barrett Reduction for Modular Operations: AI accelerators lack native support for modular arithmetic crucial in HE. The authors propose using Barrett reduction to manage these operations, converting them into simpler arithmetic operations that the hardware can execute efficiently.
Chunk-Decomposition for High-Precision Arithmetic: To handle HE’s high-precision data with low-precision AI hardware, data is decomposed into "chunks," allowing high-precision operations to be conducted through multiple low-precision calculations.
Basis Align and Matrix Aligned Transformations (BAT and MAT): These transformations help convert high-precision operations into forms that can leverage the parallelism of TPU's systolic array architecture. Specifically, BAT optimizes chunk operations into dense matrix operations, while MAT transforms vector operations into matrix operations, improving throughput and utilization of AI hardware resources.

Evaluation and Benchmarking

The paper presents a comprehensive evaluation, comparing the performance of the proposed method on TPUv4 against several contemporary platforms, including traditional CPUs, GPUs, and FPGAs, as well as dedicated HE ASICs. Key findings indicate that this approach delivers several orders of magnitude speedup over CPUs and significant performance advantages over many GPUs and FPGAs, although still lagging behind the best custom HE ASICs in performance metrics due to the absence of dedicated modular functions.

The evaluation also reveals vital insights about the scalability of this approach with respect to parameter changes and workload size, highlighting the compatibility of TPUs with a wide range of HE settings. This is crucial for practical deployment as it showcases the potential of leveraging existing cloud-based infrastructures to perform heavy computational tasks associated with HE.

Implications and Future Directions

The implications of this work are significant for both theoretical research and practical deployment of HE. It presents a cost-effective pathway to leverage existing AI resources for enhanced data privacy applications—an attractive proposition amid growing concerns about data security in cloud environments.

Future directions could include exploring hybrid systems that incorporate minimal hardware modifications to further bridge the performance gap with specialized HE accelerators. Another avenue is implementing the approach with emerging AI accelerators not yet integrated into current cloud offerings but potentially more amenable to optimization for cryptographic workloads.

Conclusion

This paper underscores the feasibility of adapting widespread AI hardware for advanced cryptographic tasks such as HE, providing a functional and efficient solution without the need for costly hardware customization. This work sets a precedent for future research in aligning the capabilities of AI accelerators with the rigorous demands of cryptographic computation, ensuring privacy-preserving processing aligns with the rapid AI advancements.

Markdown Report Issue