PromptReps: Representation via Repeated Prompts
- PromptReps is a prompt-driven technique that uses a single-word output to extract both dense and sparse embeddings for zero-shot document retrieval.
- It employs a hybrid retrieval algorithm combining ANN-based dense indices with inverted sparse indices to achieve superior performance on benchmarks like BEIR.
- Prompt repetition in non-reasoning LLMs doubles the input prompt to mitigate recency bias, leading to significant accuracy gains across diverse tasks.
PromptReps refers to a family of prompt-driven techniques leveraging LLM prompting and prompt repetition for representation learning and performance augmentation in LLM-based systems, most notably for zero-shot dense and sparse retrieval as well as accuracy improvement in non-reasoning LLMs. Two principal PromptReps lines have emerged: (1) prompt-based representation extraction for hybrid retrieval using large LLMs, and (2) prompt repetition for accuracy gains in non-reasoning LLM configurations. Below, these approaches are systematically dissected with emphasis on formal methodology, empirical results, and broader implications within the prompt-based system design space.
1. Prompt-Based Representation Extraction for Zero-Shot Retrieval
Prompt Design and Motivation
In the zero-shot document retrieval context, PromptReps utilizes carefully crafted prompts to elicit meaningful representations from LLMs without any training or fine-tuning. The canonical prompt template is:
- For passages:
1 2 3
<System> You are an AI assistant that can understand human language. <User> Passage: “[text]”. Use one most important word to represent the passage in retrieval task. Make sure your word is in lowercase. <Assistant> The word is: “___”
- For queries: Replace “Passage” with “Query”.
By constraining the model to produce a single word, PromptReps isolates the hidden state of that output token and the associated next-token logits, enabling extraction of both a dense embedding (continuous vector) and a sparse (bag-of-words) vocabulary-based vector. This design choice is motivated by two goals: (1) to maximize the semantic specificity of the representation, and (2) to obtain token-level activations for sparsification by forcing the LLM to make a hard lexical choice (Zhuang et al., 2024).
Representation Extraction Mechanics
For any input text :
- Dense embedding: , where denotes the hidden state of the last input token prior to the model's one-word prediction.
- Sparse embedding: Four-step transformation:
- Zero out logits for tokens not present in the input and apply stopword filtering.
- ReLU activation: .
- Log-saturation for dynamic range compression: .
- Top- pruning () and quantization to form the final .
Both representations are generated with a single LLM forward pass per query/document.
Hybrid Retrieval Algorithm
Dense and sparse indices are constructed:
Dense: ANN index (e.g., Faiss) using .
Sparse: Inverted index (e.g., Pyserini) using .
Final query-document score:
with fixed and denoting min–max normalized similarities.
Empirical Results
On the BEIR benchmark (13 tasks), PromptReps hybrid with Llama3-70B-Instruct achieves an nDCG@10 of 50.07, outperforming both BM25 (43.70) and E5-PTlarge (44.61), which relies on contrastive pretraining (Zhuang et al., 2024).
The hybrid approach's effectiveness grows with LLM size, with sparse-only outperforming dense-only for small models, while hybrid consistently sets the best mark at all scales.
No additional finetuning or large pretraining corpora are required; efficient, parallelizable indexing is possible.
Significance
PromptReps demonstrates that LLMs can serve as universal retrievers through model-internal prompting alone, matching or outperforming state-of-the-art unsupervised methods with substantially lower computational and data requirements. This suggests new lines of research in retrieval-augmented LLM systems and first-stage ranking paradigms.
2. Prompt Repetition in Non-Reasoning LLMs
Definition and Scope
Prompt repetition refers to literal duplication of a user query (i.e., ) as the model's input, in contrast to baseline one-shot submission, in regimes where explicit step-by-step reasoning ("chain of thought") is absent (Leviathan et al., 17 Dec 2025). This technique is evaluated on top LLMs—Gemini 2.0 Flash/Flash Lite, GPT-4o/mini, Claude 3 Haiku/Sonnet, Deepseek V3—using official APIs.
Prompt Repetition Protocol
Baseline input:
PromptReps (default):
Variations tested: verbose repetition and triple concatenation.
No modification to the output length, generation policy, or model parameters is applied.
Experimental Evaluation
Tasks: ARC Challenge, OpenBookQA, MMLU-Pro, GSM8K, MATH, as well as synthetic long-context tasks (NameIndex, MiddleMatch).
Metric: Raw accuracy; significance via McNemar's test ().
Results: PromptReps (P) wins in 47/70 model × task comparisons, never loses at reported significance. For instance:
- NameIndex (Gemini Flash-Lite): baseline 21.33%, P 97.33%.
- MiddleMatch (GPT-4o): baseline 45.0%, P 92.5%.
- ARC options-first: Delta accuracy +6.2% (Gemini), +7.5% (GPT-4o), +7.3% (Claude).
| Task/Model | Baseline | PromptReps (P) | Δ Accuracy (%) |
|---|---|---|---|
| NameIndex (Gemini) | 21.33 | 97.33 | +76.0 |
| MiddleMatch (GPT-4o) | 45.0 | 92.5 | +47.5 |
| ARC, opt-first (GPT-4o) | 62.5 | 70.0 | +7.5 |
Resource and Latency Analysis
- Output length and average API latency remain invariant (<0.1% change) compared to baseline.
- Minor latency effects observed for extremely long prompts (Anthropic models).
- Remains cost-neutral under usual context window conditions.
Mechanistic Insights
Causal LMs only attend leftward; repetition allows cross-boundary attention so that each token in the initial prompt can "see" all others via the repeated copy, effectively densifying the attention graph. This mitigates last-token/recency bias—crucial for tasks demanding accurate position or context memory (e.g., NameIndex). Prompt repetition also potentially externalizes internal "echo" signals seen in RL-finetuned LLMs (Leviathan et al., 17 Dec 2025).
Limitations
- Reduced benefit for prompts near model context limits.
- Shrinking gains if explicit chain-of-thought reasoning is activated.
- Slight latency cost for extremely long prompts on select providers.
Directions for Future Work
- Repetition-aware fine-tuning.
- Adaptive repetition depth (), partial/intermittent repetition.
- Efficient KV-cache management to avoid duplicated memory.
- Applying repetition strategies to other modalities (e.g., multi-modal inputs).
3. Related Methodological Developments
PromptReps is referenced and extended by PromptPRF, which uses offline, prompt-generated pseudo-relevance feedback features from top-ranked documents to enhance query embeddings for dense retrieval. This two-stage pipeline achieves near-parity with much larger dense retrievers in effectiveness, significantly closing the quality–cost gap (Li et al., 19 Mar 2025).
PromptReps also aligns with a broader trend in prompt-based representation learning and dynamic prompt structuring, as observed in Dynamic Prompting (Yang et al., 2023) and prompt learning across modalities for NLP, vision, and code (Chen et al., 2024, Jiang et al., 2023, Pham et al., 2023).
4. Summary Table: PromptReps Key Properties
| Aspect | PromptReps Retrieval (Zhuang et al., 2024) | Prompt Repetition (Leviathan et al., 17 Dec 2025) |
|---|---|---|
| Domain | Zero-shot retrieval | Non-reasoning LLMs (QA, MMLU, etc.) |
| Model requirement | Large decoder-only LLM (no training) | Any LLM (via API, no retraining) |
| Input encoding | One-word prompt, extract dense+sparse | Repeat input prompt (P) |
| Output | Dense (h) + sparse logits | Direct answer (no chain-of-thought) |
| Performance | Exceeds state-of-the-art unsupervised | Accuracy gain on all major LLMs |
| Cost | 1 LLM forward pass per input | No increase in output length/latency |
| Limitations | Sensitive to model scale for dense | Less effective with reasoning/CoT |
5. Implications and Broader Impact
PromptReps bridges generative and retrieval paradigms, revealing that LLMs can act as near-universal retrievers or classifiers with only minimal prompt engineering—either via careful output-token specification (retrieval) or input duplication (accuracy augmentation). Both variants operate with zero or negligible additional computational overhead and require no model modification, demonstrating high utility for both research and applied use-cases.
A plausible implication is that large LLMs, when combined with prompt-centric engineering strategies such as PromptReps, can obviate much of the infrastructure associated with learned retrievers or selective prompt-tuning. This suggests new best practices for LLM deployment in settings where pretraining new models or dataset-specific fine-tuning is infeasible, and motivates further exploration of externalized prompt manipulation as a primary design axis in LM-based systems (Zhuang et al., 2024, Leviathan et al., 17 Dec 2025).