Dual-Pass Inference Protocol
- Dual-Pass Inference Protocol is a computational framework that divides the inference task into two distinct passes across different security and efficiency domains.
- It optimizes neural network operations by delegating initial encrypted processing to secure computation (e.g., homomorphic encryption or MPC) and then executing subsequent operations in plaintext to reduce latency and bandwidth.
- The protocol also supports high-throughput LLM decoding via speculative sampling, ensuring lossless token generation while maintaining robust privacy and performance guarantees.
A dual-pass inference protocol refers to a class of algorithms and secure computation schemes in which the inference task is decomposed into two distinct computational passes, frequently spanning different entities, trust boundaries, or modes of operation. This approach aims to optimize computational efficiency, bandwidth, and privacy, typically in the context of neural network inference, secure multiparty computation, or efficient LLM decoding. Notable dual-pass paradigms include cryptography-augmented protocols for privacy-preserving machine learning and parallel speculative strategies for high-throughput sequence generation.
1. Protocol Foundations and Threat Models
Dual-pass inference protocols rely on splitting the inference process across two phases, which are assigned to different computational environments or leveraged for distinct security or efficiency benefits. Paradigmatic uses include secure client-server inference and accelerated LLM sampling.
- Secure Inference (Split HE, C²PI): Here, a model is split into segments processed using different cryptographic guarantees. Threat models are typically semi-honest (honest-but-curious), with distinct confidentiality goals for data and model weights. The client possesses private data , while the server holds a proprietary model or . The client and server exchange intermediate computation—either encrypted activations or (empirically) privacy-tolerant features—ensuring that sensitive data or proprietary model information is not exposed to adversarial analysis beyond declared leakage-tolerance levels (Pereteanu et al., 2022, Zhang et al., 2023).
- High-Throughput Generation (PaSS): The LLM operates in a dual-pass fashion, first drafting candidate continuations speculatively, then validating those drafts for sampling correctness. This is not motivated by privacy, but by amortizing the high memory-bandwidth cost of LLM decoding over multiple tokens (Monea et al., 2023).
2. Protocol Decomposition and Stepwise Execution
Secure Neural Inference (Split HE, C²PI)
For models partitioned into sequential layers, two or more protocol passes are assigned as follows:
- Split HE (Pereteanu et al., 2022):
- Offline Preprocessing: The server splits at two cutpoints , sending the central subnetwork (in plaintext) to the client. The client generates cryptographic keys for the CKKS homomorphic encryption (HE) scheme.
- Pass 1 (Encrypted Client-to-Server): The client encrypts and transmits it; the server homomorphically evaluates (executed with HE), returning an encrypted activation.
- Plaintext Local Pass: The client decrypts, then computes on the plaintext intermediate result.
- Pass 2 (Encrypted Client-to-Server/Server-to-Client): The client re-encrypts the output and returns it; the server evaluates homomorphically, transmitting the final ciphertext to the client, who decrypts to obtain .
- C²PI (Zhang et al., 2023):
- Partitioning: The neural network is split at layer .
- Pass 1 (Crypto): Layers are computed using secret-sharing MPC. At the boundary, shares of are reconstructed.
- Pass 2 (Clear): The remainder is executed unencrypted; the server returns to the client. The optimal boundary is determined by balancing empirical privacy leakage (measured via DINA) and resource cost.
Parallel Speculative Sampling (PaSS) (Monea et al., 2023)
PaSS implements a two-stage procedure for LLM decoding at each generation step:
- Drafting Pass: The input sequence is extended by learnable “look-ahead” tokens ; a single forward pass produces the regular next-token logits and up to speculative logits conditioned on the look-ahead positions. Drafted tokens are sampled from these distributions.
- Validation Pass: For each speculative token, the acceptance ratio
is computed. Each draft token is accepted or rejected via sampling; if rejected, the next token is newly sampled from the “residual” conditional. In the best case, many drafts are accepted, reducing the per-token cost.
3. Security and Privacy Guarantees
Model and Data Confidentiality
- Split HE: Ensures data confidentiality by never exposing in the clear to the server; model confidentiality is user-configurable—client only receives a subnetwork and cannot significantly reconstruct or extract beyond random chance, validated via model extraction analysis on CIFAR-10 and MNIST.
- C²PI: Guarantees model privacy formally on crypto-protected segments; data confidentiality on the clear pass is limited by empirical leakage, which is quantified by applying trained inference data privacy attacks (IDPAs), specifically the distillation-based inverse-network attack (DINA). Leakage is upper-bounded (e.g., by SSIM ), and the protocol boundary is chosen accordingly (Zhang et al., 2023).
Lossless Decoding Guarantee
- PaSS: Guarantees output tokens are distributed identically to pure autoregressive decoding; speculative drafts are validated, with rejection sampling correcting any discrepancies to preserve the original sequence distribution (Monea et al., 2023).
4. Complexity, Communication, and Empirical Performance
Secure Inference Protocols
| Protocol | End-to-End Latencya | Communicationb | Relative Speedup |
|---|---|---|---|
| Delphi | 12.9 s (CIFAR-10) | 1236 MB | 1× |
| Split HE | 4.76 s (CIFAR-10) | 4.3 MB | 2.7×–10× |
| C²PIc | 2351–4409 s (VGG16/19, LAN) | 71.9–5143 MB | 1.46×–2.9× |
a “Latency” refers to wall-clock inference time; b communication for a representative protocol run; c compared to Delphi or Cheetah baseline as reported.
- Split HE reduces communication by up to 290× and is 2.5×–10× faster than prior work (Gazelle, Delphi, Concrete, Falcon/EVA).
- C²PI achieves up to 2.89× speedup over Delphi and 2.75× lower communication over Cheetah, at the cost of empirically limited leakage (Pereteanu et al., 2022, Zhang et al., 2023).
Parallel Speculative Sampling (PaSS)
- On a 7B-parameter LLaMA model (512 tokens, 32-token prompt): baseline autoregressive decoding takes 12.5 s; PaSS (L=4) achieves 9.8 s (22% speed-up); up to 30% speed-up observed for low-temperature sampling.
- The protocol requires only extra parameters (e.g., 16,384 parameters for ).
- Predicted speed-up is upper-bounded by , if drafts are always accepted (Monea et al., 2023).
5. Empirical Security Evaluation and Leakage Measurement
Security evaluations incorporate both theoretical and empirical leakage analyses:
- Model Extraction and Membership Inference: Split HE demonstrates lower attack fidelity (38–59%) as more layers are kept encrypted, with much lower leakage than prior SplitNNs (e.g., 95% on MNIST with Hall et al.’s protocol) (Pereteanu et al., 2022).
- Inference Data Privacy Attacks (IDPAs): In C²PI, leakage is quantified using the DINA methodology, which attempts to reconstruct given a feature boundary . The boundary is set such that average SSIM leakage is below a user-specified threshold (e.g., ).
6. Protocol Variants and Extensions
Extensions in Secure Inference
- Split Points Optimization: Both Split HE and C²PI optimize the choice of neural network split points according to latency, bandwidth, and privacy-leakage tradeoffs. Empirical privacy analyses using reconstruction fidelity or DINA leakage drive the optimal segmentation.
- Cryptographic Primitives: Split HE is based on CKKS homomorphic encryption (SEAL via TenSEAL backend); C²PI exploits additive secret-sharing and precomputed Beaver triples to minimize online MPC cost for early layers (Pereteanu et al., 2022, Zhang et al., 2023).
Enhancements in Speculative Decoding
- Varied look-ahead embedding strategies (across layers or heads).
- Dynamic adjustment of the speculative window to maximize draft acceptance rates.
- Integration with key-value caching or alternative speculative methods (e.g., greedy sampling blends) (Monea et al., 2023).
7. Comparison to Related Techniques
| Protocol | Security Model | Efficiency Optimization | Application Domain |
|---|---|---|---|
| Split HE | Formal (HE) + split | Minimize HE, local compute | Private DNN inference |
| C²PI | Formal+Empirical (MPC) | Reduce MPC layers, empirical privacy | Private DNN inference |
| PaSS | Probabilistic, lossless | Amortized memory access | LLM sequence generation |
- Prior secure inference methods (e.g., Gazelle, Delphi, Concrete) encrypt and transmit all activations, which increases both latency and bandwidth requirements. Hall et al.’s protocol is limited to small networks and exhibits high model extraction leakage (Pereteanu et al., 2022, Zhang et al., 2023).
- PaSS is distinct in its focus on lossless, high-throughput decoding for LLMs and avoids the need for a second model or deep coupling with the tokenizer as in classical speculative sampling (Monea et al., 2023).
The dual-pass inference protocol thus encompasses both rigorous privacy-preserving inference for neural network models across trust boundaries and efficient high-throughput LLM decoding, unifying disparate areas of research through a common structural principle: leveraging a sequence of passes that partition computation for optimal privacy, bandwidth, and/or efficiency trade-offs. Key implementations, including Split HE, C²PI, and PaSS, provide evidence for practical utility and strong security/efficiency among state-of-the-art protocols (Pereteanu et al., 2022, Monea et al., 2023, Zhang et al., 2023).