Dual-Pass Inference Protocol

Updated 29 January 2026

Dual-Pass Inference Protocol is a computational framework that divides the inference task into two distinct passes across different security and efficiency domains.
It optimizes neural network operations by delegating initial encrypted processing to secure computation (e.g., homomorphic encryption or MPC) and then executing subsequent operations in plaintext to reduce latency and bandwidth.
The protocol also supports high-throughput LLM decoding via speculative sampling, ensuring lossless token generation while maintaining robust privacy and performance guarantees.

A dual-pass inference protocol refers to a class of algorithms and secure computation schemes in which the inference task is decomposed into two distinct computational passes, frequently spanning different entities, trust boundaries, or modes of operation. This approach aims to optimize computational efficiency, bandwidth, and privacy, typically in the context of neural network inference, secure multiparty computation, or efficient LLM decoding. Notable dual-pass paradigms include cryptography-augmented protocols for privacy-preserving machine learning and parallel speculative strategies for high-throughput sequence generation.

1. Protocol Foundations and Threat Models

Dual-pass inference protocols rely on splitting the inference process across two phases, which are assigned to different computational environments or leveraged for distinct security or efficiency benefits. Paradigmatic uses include secure client-server inference and accelerated LLM sampling.

Secure Inference (Split HE, C²PI): Here, a model is split into segments processed using different cryptographic guarantees. Threat models are typically semi-honest (honest-but-curious), with distinct confidentiality goals for data and model weights. The client possesses private data $x$ , while the server holds a proprietary model $f$ or $\mathbf{M}$ . The client and server exchange intermediate computation—either encrypted activations or (empirically) privacy-tolerant features—ensuring that sensitive data or proprietary model information is not exposed to adversarial analysis beyond declared leakage-tolerance levels (Pereteanu et al., 2022, Zhang et al., 2023).
High-Throughput Generation (PaSS): The LLM operates in a dual-pass fashion, first drafting candidate continuations speculatively, then validating those drafts for sampling correctness. This is not motivated by privacy, but by amortizing the high memory-bandwidth cost of LLM decoding over multiple tokens (Monea et al., 2023).

2. Protocol Decomposition and Stepwise Execution

Secure Neural Inference (Split HE, C²PI)

For models $f = f_M \circ \cdots \circ f_1$ partitioned into sequential layers, two or more protocol passes are assigned as follows:

Split HE (Pereteanu et al., 2022):

Offline Preprocessing: The server splits $f$ at two cutpoints $N, Z$ , sending the central subnetwork $f_2$ (in plaintext) to the client. The client generates cryptographic keys for the CKKS homomorphic encryption (HE) scheme.
Pass 1 (Encrypted Client-to-Server): The client encrypts $x$ and transmits it; the server homomorphically evaluates $f_1$ (executed with HE), returning an encrypted activation.
Plaintext Local Pass: The client decrypts, then computes $f_2$ on the plaintext intermediate result.
Pass 2 (Encrypted Client-to-Server/Server-to-Client): The client re-encrypts the $f$ 0 output and returns it; the server evaluates $f$ 1 homomorphically, transmitting the final ciphertext to the client, who decrypts to obtain $f$ 2.

C²PI (Zhang et al., 2023):

Partitioning: The neural network $f$ 3 is split at layer $f$ 4.
Pass 1 (Crypto): Layers $f$ 5 are computed using secret-sharing MPC. At the boundary, shares of $f$ 6 are reconstructed.
Pass 2 (Clear): The remainder $f$ 7 is executed unencrypted; the server returns $f$ 8 to the client. The optimal boundary $f$ 9 is determined by balancing empirical privacy leakage (measured via DINA) and resource cost.

PaSS implements a two-stage procedure for LLM decoding at each generation step:

Drafting Pass: The input sequence is extended by $\mathbf{M}$ 0 learnable “look-ahead” tokens $\mathbf{M}$ 1; a single forward pass produces the regular next-token logits and up to $\mathbf{M}$ 2 speculative logits conditioned on the look-ahead positions. Drafted tokens $\mathbf{M}$ 3 are sampled from these distributions.
Validation Pass: For each speculative token, the acceptance ratio

$\mathbf{M}$ 4

is computed. Each draft token is accepted or rejected via sampling; if rejected, the next token is newly sampled from the “residual” conditional. In the best case, many drafts are accepted, reducing the per-token cost.

3. Security and Privacy Guarantees

Model and Data Confidentiality

Split HE: Ensures data confidentiality by never exposing $\mathbf{M}$ 5 in the clear to the server; model confidentiality is user-configurable—client only receives a subnetwork $\mathbf{M}$ 6 and cannot significantly reconstruct or extract $\mathbf{M}$ 7 beyond random chance, validated via model extraction analysis on CIFAR-10 and MNIST.
C²PI: Guarantees model privacy formally on crypto-protected segments; data confidentiality on the clear pass is limited by empirical leakage, which is quantified by applying trained inference data privacy attacks (IDPAs), specifically the distillation-based inverse-network attack (DINA). Leakage is upper-bounded (e.g., by SSIM $\mathbf{M}$ 8), and the protocol boundary is chosen accordingly (Zhang et al., 2023).

Lossless Decoding Guarantee

PaSS: Guarantees output tokens are distributed identically to pure autoregressive decoding; speculative drafts are validated, with rejection sampling correcting any discrepancies to preserve the original sequence distribution (Monea et al., 2023).

4. Complexity, Communication, and Empirical Performance

Secure Inference Protocols

Protocol	End-to-End Latency^a	Communication^b	Relative Speedup
Delphi	12.9 s (CIFAR-10)	1236 MB	1×
Split HE	4.76 s (CIFAR-10)	4.3 MB	2.7×–10×
C²PI^c	2351–4409 s (VGG16/19, LAN)	71.9–5143 MB	1.46×–2.9×

^a “Latency” refers to wall-clock inference time; ^b communication for a representative protocol run; ^c compared to Delphi or Cheetah baseline as reported.

Split HE reduces communication by up to 290× and is 2.5×–10× faster than prior work (Gazelle, Delphi, Concrete, Falcon/EVA).
C²PI achieves up to 2.89× speedup over Delphi and 2.75× lower communication over Cheetah, at the cost of empirically limited leakage (Pereteanu et al., 2022, Zhang et al., 2023).

Parallel Speculative Sampling (PaSS)

On a 7B-parameter LLaMA model (512 tokens, 32-token prompt): baseline autoregressive decoding takes 12.5 s; PaSS (L=4) achieves 9.8 s (22% speed-up); up to 30% speed-up observed for low-temperature sampling.
The protocol requires only $\mathbf{M}$ 9 extra parameters (e.g., 16,384 parameters for $f = f_M \circ \cdots \circ f_1$ 0).
Predicted speed-up is upper-bounded by $f = f_M \circ \cdots \circ f_1$ 1, if drafts are always accepted (Monea et al., 2023).

5. Empirical Security Evaluation and Leakage Measurement

Security evaluations incorporate both theoretical and empirical leakage analyses:

Model Extraction and Membership Inference: Split HE demonstrates lower attack fidelity (38–59%) as more layers are kept encrypted, with much lower leakage than prior SplitNNs (e.g., 95% on MNIST with Hall et al.’s protocol) (Pereteanu et al., 2022).
Inference Data Privacy Attacks (IDPAs): In C²PI, leakage is quantified using the DINA methodology, which attempts to reconstruct $f = f_M \circ \cdots \circ f_1$ 2 given a feature boundary $f = f_M \circ \cdots \circ f_1$ 3. The boundary is set such that average SSIM leakage is below a user-specified threshold (e.g., $f = f_M \circ \cdots \circ f_1$ 4).

6. Protocol Variants and Extensions

Extensions in Secure Inference

Split Points Optimization: Both Split HE and C²PI optimize the choice of neural network split points according to latency, bandwidth, and privacy-leakage tradeoffs. Empirical privacy analyses using reconstruction fidelity or DINA leakage drive the optimal segmentation.
Cryptographic Primitives: Split HE is based on CKKS homomorphic encryption (SEAL via TenSEAL backend); C²PI exploits additive secret-sharing and precomputed Beaver triples to minimize online MPC cost for early layers (Pereteanu et al., 2022, Zhang et al., 2023).

Enhancements in Speculative Decoding

Varied look-ahead embedding strategies (across layers or heads).
Dynamic adjustment of the speculative window $f = f_M \circ \cdots \circ f_1$ 5 to maximize draft acceptance rates.
Integration with key-value caching or alternative speculative methods (e.g., greedy sampling blends) (Monea et al., 2023).

Protocol	Security Model	Efficiency Optimization	Application Domain
Split HE	Formal (HE) + split	Minimize HE, local compute	Private DNN inference
C²PI	Formal+Empirical (MPC)	Reduce MPC layers, empirical privacy	Private DNN inference
PaSS	Probabilistic, lossless	Amortized memory access	LLM sequence generation

Prior secure inference methods (e.g., Gazelle, Delphi, Concrete) encrypt and transmit all activations, which increases both latency and bandwidth requirements. Hall et al.’s protocol is limited to small networks and exhibits high model extraction leakage (Pereteanu et al., 2022, Zhang et al., 2023).
PaSS is distinct in its focus on lossless, high-throughput decoding for LLMs and avoids the need for a second model or deep coupling with the tokenizer as in classical speculative sampling (Monea et al., 2023).

The dual-pass inference protocol thus encompasses both rigorous privacy-preserving inference for neural network models across trust boundaries and efficient high-throughput LLM decoding, unifying disparate areas of research through a common structural principle: leveraging a sequence of passes that partition computation for optimal privacy, bandwidth, and/or efficiency trade-offs. Key implementations, including Split HE, C²PI, and PaSS, provide evidence for practical utility and strong security/efficiency among state-of-the-art protocols (Pereteanu et al., 2022, Monea et al., 2023, Zhang et al., 2023).

Markdown Report Issue Upgrade to Chat

References (3)

Split HE: Fast Secure Inference Combining Split Learning and Homomorphic Encryption (2022)

C2PI: An Efficient Crypto-Clear Two-Party Neural Network Private Inference (2023)

PaSS: Parallel Speculative Sampling (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-Pass Inference Protocol.

Dual-Pass Inference Protocol

1. Protocol Foundations and Threat Models

2. Protocol Decomposition and Stepwise Execution

Secure Neural Inference (Split HE, C²PI)

Parallel Speculative Sampling (PaSS) (Monea et al., 2023)

3. Security and Privacy Guarantees

Model and Data Confidentiality

Lossless Decoding Guarantee

4. Complexity, Communication, and Empirical Performance

Secure Inference Protocols

Parallel Speculative Sampling (PaSS)

5. Empirical Security Evaluation and Leakage Measurement

6. Protocol Variants and Extensions

Extensions in Secure Inference

Enhancements in Speculative Decoding

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Dual-Pass Inference Protocol

1. Protocol Foundations and Threat Models

2. Protocol Decomposition and Stepwise Execution

Secure Neural Inference (Split HE, C²PI)

Parallel Speculative Sampling (PaSS) (Monea et al., 2023)

3. Security and Privacy Guarantees

Model and Data Confidentiality

Lossless Decoding Guarantee

4. Complexity, Communication, and Empirical Performance

Secure Inference Protocols

Parallel Speculative Sampling (PaSS)

5. Empirical Security Evaluation and Leakage Measurement

6. Protocol Variants and Extensions

Extensions in Secure Inference

Enhancements in Speculative Decoding

7. Comparison to Related Techniques

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics