Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parallel Context-of-Experts Decoding

Updated 14 January 2026
  • Parallel Context-of-Experts Decoding is a framework that treats each retrieved document as an isolated expert to enable multi-document reasoning without retraining.
  • It employs retrieval-aware contrastive decoding to calibrate expert predictions against a model prior, ensuring efficient evidence aggregation and robust performance.
  • The approach synchronizes parallel token decoding and cache updates, dramatically reducing prefilling costs and latency compared to traditional long-prompt methods.

Parallel Context-of-Experts Decoding (Pced) is a training-free, decode-time framework for retrieval-augmented generation (RAG). It addresses the classical trade-off in RAG between enabling cross-document reasoning via long concatenated prompts (incurring prohibitive prefill costs) and fast single-document encoding (which prevents multi-document interaction). Pced reconceptualizes retrieved documents as isolated “experts” and achieves evidence aggregation through a novel retrieval-aware contrastive decoding rule at generation time. This approach enables cross-document reasoning capabilities while maintaining high computational efficiency and robustness in large or noisy document pools (Corallo et al., 13 Jan 2026).

1. Framework Overview

Pced operates without retraining or model modification. Its primary workflow consists of three phases: offline preparation, parallel encoding, and retrieval-aware contrastive decoding.

  1. Offline Preparation:
    • A document datastore D\mathcal{D} is constructed. For each document did_i, its embedding eie_i is stored for retrieval, as well as its key-value (KV) cache KiK_i for future parallel decoding.
    • At query time, top-NN documents are retrieved and reranked. The retrieval score and reranker score are combined using the harmonic mean to yield a normalized relevance rk(0,1)r_k \in (0,1) for each expert.
    • In addition to NN contextual experts, an “amateur expert” K0=K_0 = \emptyset (representing the model prior) is instantiated.
  2. Parallel Encoding:
    • Each of the N+1N+1 experts processes the same prompt (query prefix qq and previously generated tokens) in one batched forward pass, computing logits skRVs_k \in \mathbb{R}^{|V|} for vocabulary VV.
    • No cross-document token concatenation or attention is performed; experts remain isolated except for synchronization at the decoding step.
  3. Retrieval-Aware Contrastive Decoding:
    • At each generation step, contextual experts’ predictions are contrastively calibrated against the model prior and modulated by the retrieval prior, yielding s^k\hat s_k (adjusted logits).
    • The next token yty_t is selected by maximizing over all experts’ adjusted logits for each vocabulary entry.
    • The generated token is appended to all experts’ histories, synchronizing future context and enabling “evidence stitching” across otherwise isolated caches.

This process enables efficient multi-document reasoning and avoids the quadratic scaling and latency bottlenecks characteristic of traditional long-prompt RAG approaches.

2. Mathematical Formulation

The fundamental innovation in Pced is its retrieval-aware contrastive decoding rule. Let VV denote the output vocabulary. At each decoding step tt and for each expert k{0,1,...,N}k \in \{0,1,...,N\} (where 0 is the amateur/prior expert):

  • s0(v)s_0(v): logits from the model prior (no context) for token vv.
  • sk(v)s_k(v): logits from each contextual expert k=1...Nk=1...N.
  • rk(0,1)r_k \in (0,1): fused, normalized relevance for expert kk’s retrieved document.
  • β00\beta_0 \geq 0: contrastive strength (set dynamically on the first token, fixed thereafter; follows AdaCAD).
  • γ>0\gamma > 0: retrieval-gating weight (empirically set to $2.5$).

For each k=1...Nk=1...N,

s^k(v)=(1+β0)sk(v)β0s0(v)+γlogrk\hat s_k(v) = (1+\beta_0) s_k(v) - \beta_0 s_0(v) + \gamma \log r_k

The token to emit at step tt is selected by: yt=argmaxvVmaxk{1,,N}s^k(v)y_t = \arg\max_{v \in V} \max_{k \in \{1,\dots,N\}} \hat s_k(v) This mechanism contrasts each expert’s evidence against the model’s uninformed prior while favoring high-relevance documents.

3. Algorithmic Description

The Pced decoding procedure is summarized in the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Inputs:
  query q
  precomputed caches {K,,K_N} and relevance scores {r,,r_N}
  LLM LM with access to logits
  contrastive weight schedule for β
  retrieval weight γ

Initialize:
  caches  {K=}  {K,,K_N}
  prefix_tokens  tokenize(q)
  β  undefined
  generation_history  prefix_tokens

For t = 1 to T_max:
  # 1. Batch forward pass across experts
  For each expert k in 0N:
    sₖ  LM.forward(caches[k], generation_history)
  # 2. Set β₀ on first step (t=1) if not already set
  If t == 1:
    compute β  f_JS-divergence(s, ss_N)  # AdaCAD procedure
  # 3. Contrastive + retrieval calibration
  For k = 1N:
    For each vocabulary token v in V:
      ĥₖ(v)  (1)·sₖ(v) - β·s(v) + γ·log(rₖ)
  # 4. Expert selection and token emission
  (k*, v*)  argmax over k=1N and vV of ĥₖ(v)
  y_t  v*
  # 5. Update history and KV caches in parallel
  Append y_t to generation_history
  For each expert k in 0N:
    caches[k]  LM.update_cache(caches[k], y_t)
  If y_t is end-of-sequence: break

Output: detokenize(generation_history without the query)

Critical points include the synchronized update of all experts’ caches and shared generation history, and amortization of computational cost via batched forward passes (Corallo et al., 13 Jan 2026).

4. Computational and Latency Properties

Pced achieves substantial improvements in computational efficiency and latency relative to conventional and alternative RAG methods. Characteristics of the principal approaches are summarized below:

Scheme Prefill Cost/TTFT Interaction Per-Token Cost Memory
Long-Prompt Attention O((NL)2)O((N\cdot L)^2) Native (full context) Quadratic; scales poorly with NN N/A
Separate-KV + Merge Moderate (recompute) Partial (by merging) Requires selective recomputation N/A
Pced O(NLd)O(N L d) per cache Via decoding, not attn Linear in N,LN, L + fused logits O(NLd)O(N L d) for caches

Where NN is number of documents, LL is doc length, dd is hidden state dim.

  • Time-to-First-Token (TTFT): Pced, with a single batched token-forward for all experts, avoids quadratic scaling. Empirical measurements demonstrate up to 180×180\times faster TTFT ($0.14$s vs. $25.5$s) at K=90K=90 documents.
  • End-to-End Latency: For 65k-token contexts and 512-token generation, Pced is 1.7×\sim 1.7\times faster.
  • Memory: Pced stores FP16 KV caches linearly in corpus size and dd (e.g., $11$ GB for $1,200$ passages with Llama-3.1-8B).
  • Suitability: It is naturally suited for read-heavy, static corpora due to amortized offline preparation.

5. Benchmark Results and Empirical Findings

Pced exhibits strong empirical performance in multi-document reasoning and RAG contexts. Representative results (APE: Attentive Prompt Encoding, Pced Dense variant):

Benchmark Dataset/Task APE Pced-Dense
LOFT RAG HotpotQA 27 66
LOFT RAG MuSiQue 11 34
LOFT RAG NQ 38 81
LOFT ICL Web 58.9 62.2
LOFT ICL Date 40.0 57.8
LongBench (Qwen3-8B) Multi-Doc QA (Hotpot) 56.3 62.6
LongBench Few-Shot (TriviaQA) 84.0 88.8
LongBench Code (RepoB-P) 51.1 60.1

Additional ablation experiments demonstrate:

  • Removal of contrastive term (β0=0\beta_0=0) or retrieval prior (γ=0\gamma=0) substantially reduces performance (e.g., NQ falls from 85 to 52 without retrieval gating).
  • Expert aggregation via max-selection is crucial for multi-hop reasoning; max selection yields 64 on HotpotQA versus 56 for a weighted mixture approach.

6. Limitations and Prospective Extensions

Pced’s design introduces unique constraints and opportunities for further development:

  • Dependence on Model Logits: Full access to expert per-token logits is required, precluding use with closed APIs that yield only sampled tokens.
  • Retrieval Quality Sensitivity: If relevant documents are omitted or erroneously ranked, Pced cannot recover evidence absent from the expert pool.
  • Storage Footprint: Storing FP16 KV caches becomes linearly costly with corpus and hidden-state size.

Potential avenues for enhancement include:

  • End-to-end training of expert selection within the LLM to reduce reliance on external retrievers/rerankers.
  • Dynamic or instance-wise optimization of γ\gamma and other aggregation rules.
  • Hybrid methods combining limited cross-attention with decode-time fusion for richer inter-expert dependencies.

A plausible implication is that Pced provides a competitive trade-off, recovering much of the cross-document reasoning power of long-prompt approaches while achieving substantial speed and efficiency improvements, particularly in settings with large candidate pools or noisy retrieval (Corallo et al., 13 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel Context-of-Experts Decoding (Pced).