Parallel Context-of-Experts Decoding

Updated 14 January 2026

Parallel Context-of-Experts Decoding is a framework that treats each retrieved document as an isolated expert to enable multi-document reasoning without retraining.
It employs retrieval-aware contrastive decoding to calibrate expert predictions against a model prior, ensuring efficient evidence aggregation and robust performance.
The approach synchronizes parallel token decoding and cache updates, dramatically reducing prefilling costs and latency compared to traditional long-prompt methods.

Parallel Context-of-Experts Decoding (Pced) is a training-free, decode-time framework for retrieval-augmented generation (RAG). It addresses the classical trade-off in RAG between enabling cross-document reasoning via long concatenated prompts (incurring prohibitive prefill costs) and fast single-document encoding (which prevents multi-document interaction). Pced reconceptualizes retrieved documents as isolated “experts” and achieves evidence aggregation through a novel retrieval-aware contrastive decoding rule at generation time. This approach enables cross-document reasoning capabilities while maintaining high computational efficiency and robustness in large or noisy document pools (Corallo et al., 13 Jan 2026).

1. Framework Overview

Pced operates without retraining or model modification. Its primary workflow consists of three phases: offline preparation, parallel encoding, and retrieval-aware contrastive decoding.

Offline Preparation:
- A document datastore $\mathcal{D}$ is constructed. For each document $d_i$ , its embedding $e_i$ is stored for retrieval, as well as its key-value (KV) cache $K_i$ for future parallel decoding.
- At query time, top- $N$ documents are retrieved and reranked. The retrieval score and reranker score are combined using the harmonic mean to yield a normalized relevance $r_k \in (0,1)$ for each expert.
- In addition to $N$ contextual experts, an “amateur expert” $K_0 = \emptyset$ (representing the model prior) is instantiated.
Parallel Encoding:
- Each of the $N+1$ experts processes the same prompt (query prefix $q$ and previously generated tokens) in one batched forward pass, computing logits $s_k \in \mathbb{R}^{|V|}$ for vocabulary $V$ .
- No cross-document token concatenation or attention is performed; experts remain isolated except for synchronization at the decoding step.
Retrieval-Aware Contrastive Decoding:
- At each generation step, contextual experts’ predictions are contrastively calibrated against the model prior and modulated by the retrieval prior, yielding $\hat s_k$ (adjusted logits).
- The next token $y_t$ is selected by maximizing over all experts’ adjusted logits for each vocabulary entry.
- The generated token is appended to all experts’ histories, synchronizing future context and enabling “evidence stitching” across otherwise isolated caches.

This process enables efficient multi-document reasoning and avoids the quadratic scaling and latency bottlenecks characteristic of traditional long-prompt RAG approaches.

2. Mathematical Formulation

The fundamental innovation in Pced is its retrieval-aware contrastive decoding rule. Let $V$ denote the output vocabulary. At each decoding step $t$ and for each expert $k \in \{0,1,...,N\}$ (where 0 is the amateur/prior expert):

$s_0(v)$ : logits from the model prior (no context) for token $v$ .
$s_k(v)$ : logits from each contextual expert $k=1...N$ .
$r_k \in (0,1)$ : fused, normalized relevance for expert $k$ ’s retrieved document.
$\beta_0 \geq 0$ : contrastive strength (set dynamically on the first token, fixed thereafter; follows AdaCAD).
$\gamma > 0$ : retrieval-gating weight (empirically set to $2.5$).

For each $k=1...N$ ,

$\hat s_k(v) = (1+\beta_0) s_k(v) - \beta_0 s_0(v) + \gamma \log r_k$

The token to emit at step $t$ is selected by: $y_t = \arg\max_{v \in V} \max_{k \in \{1,\dots,N\}} \hat s_k(v)$ This mechanism contrasts each expert’s evidence against the model’s uninformed prior while favoring high-relevance documents.

3. Algorithmic Description

The Pced decoding procedure is summarized in the following pseudocode:

Inputs:
  query q
  precomputed caches {K₁,…,K_N} and relevance scores {r₁,…,r_N}
  LLM LM with access to logits
  contrastive weight schedule for β₀
  retrieval weight γ

Initialize:
  caches ← {K₀=∅} ∪ {K₁,…,K_N}
  prefix_tokens ← tokenize(q)
  β₀ ← undefined
  generation_history ← prefix_tokens

For t = 1 to T_max:
  # 1. Batch forward pass across experts
  For each expert k in 0…N:
    sₖ ← LM.forward(caches[k], generation_history)
  # 2. Set β₀ on first step (t=1) if not already set
  If t == 1:
    compute β₀ ← f_JS-divergence(s₀, s₁…s_N)  # AdaCAD procedure
  # 3. Contrastive + retrieval calibration
  For k = 1…N:
    For each vocabulary token v in V:
      ĥₖ(v) ← (1+β₀)·sₖ(v) - β₀·s₀(v) + γ·log(rₖ)
  # 4. Expert selection and token emission
  (k*, v*) ← argmax over k=1…N and v∈V of ĥₖ(v)
  y_t ← v*
  # 5. Update history and KV caches in parallel
  Append y_t to generation_history
  For each expert k in 0…N:
    caches[k] ← LM.update_cache(caches[k], y_t)
  If y_t is end-of-sequence: break

Output: detokenize(generation_history without the query)

Critical points include the synchronized update of all experts’ caches and shared generation history, and amortization of computational cost via batched forward passes (Corallo et al., 13 Jan 2026).

4. Computational and Latency Properties

Pced achieves substantial improvements in computational efficiency and latency relative to conventional and alternative RAG methods. Characteristics of the principal approaches are summarized below:

Scheme	Prefill Cost/TTFT	Interaction	Per-Token Cost	Memory
Long-Prompt Attention	$O((N\cdot L)^2)$	Native (full context)	Quadratic; scales poorly with $N$	N/A
Separate-KV + Merge	Moderate (recompute)	Partial (by merging)	Requires selective recomputation	N/A
Pced	$O(N L d)$ per cache	Via decoding, not attn	Linear in $N, L$ + fused logits	$O(N L d)$ for caches

Where $N$ is number of documents, $L$ is doc length, $d$ is hidden state dim.

Time-to-First-Token (TTFT): Pced, with a single batched token-forward for all experts, avoids quadratic scaling. Empirical measurements demonstrate up to $180\times$ faster TTFT ($0.14$s vs. $25.5$s) at $K=90$ documents.
End-to-End Latency: For 65k-token contexts and 512-token generation, Pced is $\sim 1.7\times$ faster.
Memory: Pced stores FP16 KV caches linearly in corpus size and $d$ (e.g., $11$ GB for $1,200$ passages with Llama-3.1-8B).
Suitability: It is naturally suited for read-heavy, static corpora due to amortized offline preparation.

5. Benchmark Results and Empirical Findings

Pced exhibits strong empirical performance in multi-document reasoning and RAG contexts. Representative results (APE: Attentive Prompt Encoding, Pced Dense variant):

Benchmark	Dataset/Task	APE	Pced-Dense
LOFT RAG	HotpotQA	27	66
LOFT RAG	MuSiQue	11	34
LOFT RAG	NQ	38	81
LOFT ICL	Web	58.9	62.2
LOFT ICL	Date	40.0	57.8
LongBench (Qwen3-8B)	Multi-Doc QA (Hotpot)	56.3	62.6
LongBench	Few-Shot (TriviaQA)	84.0	88.8
LongBench	Code (RepoB-P)	51.1	60.1

Additional ablation experiments demonstrate:

Removal of contrastive term ( $\beta_0=0$ ) or retrieval prior ( $\gamma=0$ ) substantially reduces performance (e.g., NQ falls from 85 to 52 without retrieval gating).
Expert aggregation via max-selection is crucial for multi-hop reasoning; max selection yields 64 on HotpotQA versus 56 for a weighted mixture approach.

6. Limitations and Prospective Extensions

Pced’s design introduces unique constraints and opportunities for further development:

Dependence on Model Logits: Full access to expert per-token logits is required, precluding use with closed APIs that yield only sampled tokens.
Retrieval Quality Sensitivity: If relevant documents are omitted or erroneously ranked, Pced cannot recover evidence absent from the expert pool.
Storage Footprint: Storing FP16 KV caches becomes linearly costly with corpus and hidden-state size.

Potential avenues for enhancement include:

End-to-end training of expert selection within the LLM to reduce reliance on external retrievers/rerankers.
Dynamic or instance-wise optimization of $\gamma$ and other aggregation rules.
Hybrid methods combining limited cross-attention with decode-time fusion for richer inter-expert dependencies.

A plausible implication is that Pced provides a competitive trade-off, recovering much of the cross-document reasoning power of long-prompt approaches while achieving substantial speed and efficiency improvements, particularly in settings with large candidate pools or noisy retrieval (Corallo et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Parallel Context-of-Experts Decoding for Retrieval Augmented Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel Context-of-Experts Decoding (Pced).