PromptReps: Representation via Repeated Prompts

Updated 20 January 2026

PromptReps is a prompt-driven technique that uses a single-word output to extract both dense and sparse embeddings for zero-shot document retrieval.
It employs a hybrid retrieval algorithm combining ANN-based dense indices with inverted sparse indices to achieve superior performance on benchmarks like BEIR.
Prompt repetition in non-reasoning LLMs doubles the input prompt to mitigate recency bias, leading to significant accuracy gains across diverse tasks.

PromptReps refers to a family of prompt-driven techniques leveraging LLM prompting and prompt repetition for representation learning and performance augmentation in LLM-based systems, most notably for zero-shot dense and sparse retrieval as well as accuracy improvement in non-reasoning LLMs. Two principal PromptReps lines have emerged: (1) prompt-based representation extraction for hybrid retrieval using large LLMs, and (2) prompt repetition for accuracy gains in non-reasoning LLM configurations. Below, these approaches are systematically dissected with emphasis on formal methodology, empirical results, and broader implications within the prompt-based system design space.

1. Prompt-Based Representation Extraction for Zero-Shot Retrieval

Prompt Design and Motivation

In the zero-shot document retrieval context, PromptReps utilizes carefully crafted prompts to elicit meaningful representations from LLMs without any training or fine-tuning. The canonical prompt template is:

For passages:

1
2
3

<System> You are an AI assistant that can understand human language.
<User> Passage: “[text]”. Use one most important word to represent the passage in retrieval task. Make sure your word is in lowercase.
<Assistant> The word is: “___”

For queries: Replace “Passage” with “Query”.

By constraining the model to produce a single word, PromptReps isolates the hidden state of that output token and the associated next-token logits, enabling extraction of both a dense embedding (continuous vector) and a sparse (bag-of-words) vocabulary-based vector. This design choice is motivated by two goals: (1) to maximize the semantic specificity of the representation, and (2) to obtain token-level activations for sparsification by forcing the LLM to make a hard lexical choice (Zhuang et al., 2024).

Representation Extraction Mechanics

For any input text $x$ :

Dense embedding: $e_{\text{dense}}(x) = h_{\text{last}}(x)$ , where $h_{\text{last}}$ denotes the hidden state of the last input token prior to the model's one-word prediction.
Sparse embedding: Four-step transformation:
1. Zero out logits for tokens not present in the input and apply stopword filtering.
2. ReLU activation: $\ell''_i = \max(0, \ell'_i)$ .
3. Log-saturation for dynamic range compression: $\ell'''_i = \log(1+\ell''_i)$ .
4. Top- $k$ pruning ( $k=128$ ) and quantization to form the final $e_{\text{sparse}}(x)$ .

Both representations are generated with a single LLM forward pass per query/document.

Hybrid Retrieval Algorithm

Dense and sparse indices are constructed:

Dense: ANN index (e.g., Faiss) using $e_{\text{dense}}$ .
Sparse: Inverted index (e.g., Pyserini) using $e_{\text{sparse}}$ .

Final query-document score:

$\text{score}(q, d) = \alpha\,\tilde{s}_{\text{dense}}(q, d) + (1-\alpha)\,\tilde{s}_{\text{sparse}}(q, d)$

with fixed $\alpha=0.5$ and $\tilde{s}_{\cdot}$ denoting min–max normalized similarities.

Empirical Results

On the BEIR benchmark (13 tasks), PromptReps hybrid with Llama3-70B-Instruct achieves an nDCG@10 of 50.07, outperforming both BM25 (43.70) and E5-PTlarge (44.61), which relies on contrastive pretraining (Zhuang et al., 2024).
The hybrid approach's effectiveness grows with LLM size, with sparse-only outperforming dense-only for small models, while hybrid consistently sets the best mark at all scales.
No additional finetuning or large pretraining corpora are required; efficient, parallelizable indexing is possible.

Significance

PromptReps demonstrates that LLMs can serve as universal retrievers through model-internal prompting alone, matching or outperforming state-of-the-art unsupervised methods with substantially lower computational and data requirements. This suggests new lines of research in retrieval-augmented LLM systems and first-stage ranking paradigms.

2. Prompt Repetition in Non-Reasoning LLMs

Definition and Scope

Prompt repetition refers to literal duplication of a user query (i.e., $P^2 = P \,\Vert\, P$ ) as the model's input, in contrast to baseline one-shot submission, in regimes where explicit step-by-step reasoning ("chain of thought") is absent (Leviathan et al., 17 Dec 2025). This technique is evaluated on top LLMs—Gemini 2.0 Flash/Flash Lite, GPT-4o/mini, Claude 3 Haiku/Sonnet, Deepseek V3—using official APIs.

Prompt Repetition Protocol

Baseline input: $P$
PromptReps (default): $P^2 = P \Vert P$
Variations tested: verbose repetition and triple concatenation.

No modification to the output length, generation policy, or model parameters is applied.

Experimental Evaluation

Tasks: ARC Challenge, OpenBookQA, MMLU-Pro, GSM8K, MATH, as well as synthetic long-context tasks (NameIndex, MiddleMatch).
Metric: Raw accuracy; significance via McNemar's test ( $p<0.1$ ).
Results: PromptReps (P $^2$ ) wins in 47/70 model × task comparisons, never loses at reported significance. For instance:
- NameIndex (Gemini Flash-Lite): baseline 21.33%, P $^2$ 97.33%.
- MiddleMatch (GPT-4o): baseline 45.0%, P $^2$ 92.5%.
- ARC options-first: Delta accuracy +6.2% (Gemini), +7.5% (GPT-4o), +7.3% (Claude).

Task/Model	Baseline	PromptReps (P $^2$ )	Δ Accuracy (%)
NameIndex (Gemini)	21.33	97.33	+76.0
MiddleMatch (GPT-4o)	45.0	92.5	+47.5
ARC, opt-first (GPT-4o)	62.5	70.0	+7.5

Resource and Latency Analysis

Output length and average API latency remain invariant (<0.1% change) compared to baseline.
Minor latency effects observed for extremely long prompts (Anthropic models).
Remains cost-neutral under usual context window conditions.

Mechanistic Insights

Causal LMs only attend leftward; repetition allows cross-boundary attention so that each token in the initial prompt can "see" all others via the repeated copy, effectively densifying the attention graph. This mitigates last-token/recency bias—crucial for tasks demanding accurate position or context memory (e.g., NameIndex). Prompt repetition also potentially externalizes internal "echo" signals seen in RL-finetuned LLMs (Leviathan et al., 17 Dec 2025).

Limitations

Reduced benefit for prompts near model context limits.
Shrinking gains if explicit chain-of-thought reasoning is activated.
Slight latency cost for extremely long prompts on select providers.

Directions for Future Work

Repetition-aware fine-tuning.
Adaptive repetition depth ( $n>2$ ), partial/intermittent repetition.
Efficient KV-cache management to avoid duplicated memory.
Applying repetition strategies to other modalities (e.g., multi-modal inputs).

PromptReps is referenced and extended by PromptPRF, which uses offline, prompt-generated pseudo-relevance feedback features from top-ranked documents to enhance query embeddings for dense retrieval. This two-stage pipeline achieves near-parity with much larger dense retrievers in effectiveness, significantly closing the quality–cost gap (Li et al., 19 Mar 2025).

PromptReps also aligns with a broader trend in prompt-based representation learning and dynamic prompt structuring, as observed in Dynamic Prompting (Yang et al., 2023) and prompt learning across modalities for NLP, vision, and code (Chen et al., 2024, Jiang et al., 2023, Pham et al., 2023).

4. Summary Table: PromptReps Key Properties

Aspect	PromptReps Retrieval (Zhuang et al., 2024)	Prompt Repetition (Leviathan et al., 17 Dec 2025)
Domain	Zero-shot retrieval	Non-reasoning LLMs (QA, MMLU, etc.)
Model requirement	Large decoder-only LLM (no training)	Any LLM (via API, no retraining)
Input encoding	One-word prompt, extract dense+sparse	Repeat input prompt (P $^2$ )
Output	Dense (h $_{\text{last}}$ ) + sparse logits	Direct answer (no chain-of-thought)
Performance	Exceeds state-of-the-art unsupervised	Accuracy gain on all major LLMs
Cost	1 LLM forward pass per input	No increase in output length/latency
Limitations	Sensitive to model scale for dense	Less effective with reasoning/CoT

5. Implications and Broader Impact

PromptReps bridges generative and retrieval paradigms, revealing that LLMs can act as near-universal retrievers or classifiers with only minimal prompt engineering—either via careful output-token specification (retrieval) or input duplication (accuracy augmentation). Both variants operate with zero or negligible additional computational overhead and require no model modification, demonstrating high utility for both research and applied use-cases.

A plausible implication is that large LLMs, when combined with prompt-centric engineering strategies such as PromptReps, can obviate much of the infrastructure associated with learned retrievers or selective prompt-tuning. This suggests new best practices for LLM deployment in settings where pretraining new models or dataset-specific fine-tuning is infeasible, and motivates further exploration of externalized prompt manipulation as a primary design axis in LM-based systems (Zhuang et al., 2024, Leviathan et al., 17 Dec 2025).