Privacy-Preserving Machine Learning
- PPML is a research field that uses cryptographic techniques and differential privacy to enable secure machine learning on sensitive data.
- It employs methods such as Secure Multi-Party Computation, Homomorphic Encryption, and Trusted Execution Environments to balance confidentiality with computational efficiency.
- PPML frameworks support applications in healthcare, finance, and IoT while mitigating adversarial risks like membership inference and data reconstruction.
Privacy-Preserving Machine Learning (PPML) is a research field addressing the challenge of enabling machine learning on sensitive data while guaranteeing rigorous privacy protection. PPML provides protocols and system architectures that allow for collaborative, outsourced, or federated model training and inference without exposing raw data, intermediate computation, or model parameters to untrusted parties. Techniques are grounded in formal threat models and cryptographic primitives—including Differential Privacy, Secure Multi-Party Computation, Homomorphic and Hybrid Homomorphic Encryption, Trusted Execution Environments, and emerging primitives such as Functional Encryption. The design and evaluation of PPML systems involves balancing data confidentiality, computational efficiency, communication complexity, and model utility under rigorous notions of adversarial risk, such as honest-but-curious or malicious security.
1. Core Threat Models and Privacy Guarantees
PPML models the adversary’s knowledge and capabilities along several axes, aiming to mitigate attacks such as membership inference, attribute/model inversion, data reconstruction, and property inference (Al-Rubaie et al., 2018, Zhang et al., 2024). In adversarial settings, attacks may attempt to determine if a datapoint was in the training set, reconstruct features from model outputs, or infer sensitive properties about individuals. PPML frameworks articulate privacy guarantees against adversaries ranging from semi-honest (honest-but-curious, non-colluding) to malicious (actively deviating from protocol), with specific threat models dictating protocol design and efficiency—for example, requiring robustness and fairness in honest-majority settings, or tolerating arbitrary aborts (Koti et al., 2020, Lu et al., 2024).
Differential Privacy (DP) formally bounds the influence of any single record on the model’s output; the -DP definition ensures that exposure of outputs changes negligibly with the inclusion or exclusion of any one user (Guerra-Manzanares et al., 2023, Zhang et al., 2024). Cryptographic PPML methods guarantee semantic security (e.g., LWE for HE, DDH for FE, standard UC for SMC) and resistance to information leakage beyond prescribed outputs (such as only f(x) in FE-based systems).
2. Foundational Cryptographic and Systems Techniques
PPML is built from several cryptographic and systems primitives, often combined for optimal privacy-utility tradeoffs (Zeng et al., 19 Jul 2025, Al-Rubaie et al., 2018, Lee et al., 2020):
- Differential Privacy: DP mechanisms add calibrated noise at various pipeline stages—data, gradients, model parameters. DP-SGD, PATE, and output perturbation are canonical instantiations (Guerra-Manzanares et al., 2023, Xu et al., 2021).
- Secure Multi-Party Computation (MPC): Uses arithmetic secret-sharing (Beaver triples, Shamir’s scheme), garbled circuits, and oblivious transfer to jointly compute functions without exposing individual inputs (Koti et al., 2020, Suresh, 2021, Lu et al., 2024). Systems like MPCLeague implement highly optimized n-party MPC over fixed-size rings for ML workloads, supporting linear/logistic regression and deep neural networks.
- Homomorphic Encryption (HE) and Hybrid Homomorphic Encryption (HHE): Enables outsourced, non-interactive encrypted inference and training. CKKS/BFV/Paillier allow arithmetic circuit evaluation on ciphertexts; HHE schemes (e.g., PASTA+BFV) further minimize client overhead via symmetric-key encapsulation (Frimpong et al., 2024, Nguyen et al., 2024).
- Functional Encryption (FE): IPFE and QFE primitives support direct computation of inner products or quadratic forms on encrypted data, returning f(x) in the clear while hiding everything else (Panzade et al., 2022).
- Trusted Execution Environments (TEEs): Secure enclaves such as Intel SGX provide hardware-backed protection of models and data at rest and during runtime. PPML frameworks employ remote attestation, file system encryption, and hardware memory isolation to mitigate OS- or VM-level threats (Lee et al., 2020).
- Federated Learning (FL) and Secure Aggregation: Clients collaboratively train global models while keeping data local; secure aggregation protocols (e.g., double-masking, Shamir's SS, HPRG masking) ensure servers learn only model updates (Guerra-Manzanares et al., 2023, Liu et al., 2022).
3. Representative PPML Frameworks and Protocols
Several notable PPML frameworks have been developed to realize privacy-preserving ML in practical settings:
- Locally-Learn-Then-Merge Ensemble PPML: Multiple data-holding providers train local random forests, upload only encrypted models, and merge predictions at inference via secure protocols without revealing data or model parameters. Providers obtain accuracy within 1–2% of plaintext baseline and per-query prediction times under 2 minutes on real EHR datasets (Giacomelli et al., 2018).
- SWIFT: Robust, maliciously secure 3PC/4PC SOC system, rigorously guaranteeing output delivery (GOD) in honest-majority settings via joint-message-passing and efficient ring-based secret sharing. Demonstrates 2× speedup over prior robust PPML frameworks (FLASH, BLAZE) and scalable deep-network inference (Koti et al., 2020).
- MPCLeague (ASTRA, SWIFT, Tetrad, ABY2.0): Offers fair and robust variants for 2–4-party PPML across linear, logistic, and SVM workloads. Under WAN settings, achieves up to 100× reduction in communication and 10× in runtime compared to SecureML and ABY3 (Suresh, 2021).
- Multi-party Secure Broad Learning System (MSBLS): Employs a constant-round masking protocol for secure feature mapping, supporting neural network classifiers with no accuracy loss and low overhead under semi-honest security (Cao et al., 2022).
- Hybrid Homomorphic Encryption PPML (GuardML, ecgPPML): Uses a lightweight symmetric cipher (PASTA) embedded in a BFV wrapper to achieve up to 300× reduction in bandwidth and 4× speedup in client encryption over plain HE on edge devices; demonstrates near-baseline accuracy on ECG classification (Frimpong et al., 2024, Nguyen et al., 2024).
- Agentic-PPML: Decouples intent parsing (cleartext LLM) from privacy-critical computation (vertical specialist neural models under MPC/HE), enabling sub-three-minute secure inference for ResNet-50 with orders-of-magnitude acceleration over monolithic encrypted LLM inference (Zhang et al., 30 Jul 2025).
- Dropout-Resilient Secure Aggregation: Shamir’s SS with seed-homomorphic PRGs tolerates client dropout and provides up to 6× faster aggregation than SecAgg+ under both semi-honest and active adversaries (Liu et al., 2022).
4. Algorithmic and Model-Level Innovations
Algorithmic advances optimize both linear and nonlinear layer evaluation for PPML (Zeng et al., 19 Jul 2025):
- Winograd/block-circulant convolutions and layer fusion techniques minimize multiplications and round complexity.
- Quantization (binary, mixed precision, dyadic) is used to decrease communication and computational cost while preserving model accuracy.
- Polynomial approximation of nonlinear activations (ReLU → quadratic, sigmoid → piecewise) enables homomorphic/SS-friendly evaluation.
- Cross-level co-design blends protocol improvements with model structure, leveraging compiler automation (CHET, EVA, HELayers) and GPU acceleration (TensorFHE, Cheddar, Over100x).
5. System-Level Implementations and Benchmarks
Integrated PPML systems have matured to support real-time ML workloads, even under WAN or resource-constrained environments:
- SGX + Graphene pipelines allow transparent encryption of files and models with zero code change for PyTorch applications; overhead is near-native except for enclave memory paging (Lee et al., 2020).
- GPU-optimized malicious secure MPC achieves 2–3× throughput gains on convolutions and dot-products for CNN inference compared to prior frameworks (VGG-16 in ~1 s per batch) (Lu et al., 2024).
- Comparative benchmarks show that robust secret-sharing, efficient dot-product, and constant-round truncation deliver up to 600× speedup over ABY3 across multiple ML algorithms (see Table in (Patra et al., 2020)).
| Protocol/Framework | Security Model | Online Latency Reduction | Key Feature |
|---|---|---|---|
| BLAZE | 3PC, semi/malicious | 50–600× over ABY3 | Communication-independent dot product |
| SWIFT | 3/4PC, robust | 2× over FLASH | Guaranteed output delivery (GOD) |
| GuardML | HHE, 3-party | 4× over BFV client | 300× bandwidth reduction (edge-compatible) |
| MPCLeague (ASTRA) | Any n, fair/robust | 4–100× over ABY3 | Multi-input gates, minimal rounds |
6. Privacy–Utility Trade-Offs and Application Domains
Designers must consider the often sharp trade-off between privacy and utility (Zhang et al., 2024, Xu et al., 2021):
- Stronger DP (smaller É›) implies higher test loss: e.g., DP-SGD on CIFAR-10 shows 75% accuracy for É›=1, 88% with no DP (Zhang et al., 2024).
- HE-based inference is ~1000× slower than plaintext but supports exact computation; HHE and GPU acceleration mitigate, but scalability for billion-parameter LLMs remains an open challenge (Zhang et al., 30 Jul 2025).
- Applications include multi-institutional healthcare predictive analytics, cross-bank fraud detection, IoT anomaly detection, secure MLaaS, and collaborative vision/surveillance systems (Giacomelli et al., 2018, Cao et al., 2022, Guerra-Manzanares et al., 2023).
7. Future Directions and Open Problems
Contemporary research efforts highlight several directions:
- Scalability of cryptographic protocols to LLM-scale models, block-sparse transformer attention, and dynamic federated domains (Zeng et al., 19 Jul 2025, Zhang et al., 30 Jul 2025).
- Hybrid approaches—combining DP, HE, MPC, TEEs, and FE for layered privacy under compositional threat models (Zhang et al., 2024, Panzade et al., 2022).
- Explainability and fairness under privacy constraints, with formalized privacy-explainability trade-offs (Guerra-Manzanares et al., 2023).
- Machine unlearning and adaptive privacy budgeting to comply with GDPR/HIPAA in streaming or online learning contexts (Xu et al., 2021, Zhang et al., 2024).
- Hardware acceleration through scalable HE compilers and native GPU/PIM integration to unlock practical encrypted training/inference (Zeng et al., 19 Jul 2025, Lu et al., 2024).
PPML has transitioned from foundational proofs-of-concept to practical systems supporting diverse ML tasks. Ongoing progress in protocol and system-level optimization, cross-disciplinary co-design, and rigorous empirical benchmarking continues to shape the future of privacy-preserving machine learning.