PaperScout: Autonomous Academic Search
- PaperScout is an autonomous agentic system that formulates academic paper search as a sequential decision-making process using large language models and reinforcement learning.
- It introduces Proximal Sequence Policy Optimization (PSPO) to address sequence-level credit assignment, optimizing retrieval in multi-turn, context-aware environments.
- The system dynamically balances exploration and exploitation by invoking search and expansion tools within a POMDP framework, leading to superior recall and relevance.
PaperScout is an autonomous agentic system for academic paper search that formulates retrieval as a sequential decision-making process. In contrast to static pipelined or workflow-driven approaches, PaperScout leverages LLMs optimized via process-aware sequence-level reinforcement learning, enabling adaptive invocation of search and expansion tools based on cumulative context and observed rewards within a Partially Observable Markov Decision Process (POMDP) framework. The core methodological advance is Proximal Sequence Policy Optimization (PSPO), which addresses the granularity mismatch of credit assignment in multi-turn, process-level retrieval environments. Empirical benchmarks demonstrate that this architecture achieves state-of-the-art recall and paper relevance compared to both traditional and agentic baselines (Pan et al., 15 Jan 2026).
1. Agentic Sequential Decision Framework
PaperScout casts the scholarly literature search task as a POMDP with states, actions, transitions, observations, and rewards aligned to the evolving context of a retrieval session. At each timestep, the agent observes a partial summary of the current paper pool—a dual-list view comprising expanded and unexpanded candidate papers, augmented by the full history of tool calls. The two atomic external tools available to the agent are:
- Search: accepts a natural-language query and returns a batch of candidate papers from a specified search backend.
- Expand: given a paper identifier (e.g., arXiv ID), retrieves its outgoing references (bibliography).
At every turn, the policy generates a response —a sequence containing a reasoning trace and a set of tool calls. After execution, the environment merges new results into the paper pool, filters by a relevance threshold , computes the marginal utility as the reward , and updates the observation . This closed-loop, context-sensitive decision process enables the agent to allocate search effort adaptively between breadth (exploration via Search) and depth (exploitation via Expand) (Pan et al., 15 Jan 2026).
2. Proximal Sequence Policy Optimization (PSPO)
Training a multi-turn agent in this regime presents a granularity alignment problem: traditional reinforcement learning (RL) approaches such as Proximal Policy Optimization (PPO) assign credit at the token level, which is not appropriate when rewards accrue per complete sequence response. PSPO directly addresses this by:
- Full-sequence credit assignment: The action at each step is the entire output ; all tokens share the same scalar reward ;
- Generalized Advantage Estimation (GAE): Computed at the sequence level as
with ;
- Importance-ratio surrogate loss: Updates are performed via a clipped objective over the sequence-level log-probability ratio , ensuring stable policy optimization even under off-policy rollouts;
- Critic pretraining and normalization: The value function is pre-trained under a fixed (e.g., random or imitation) policy to reduce bias prior to joint actor-critic updates (Pan et al., 15 Jan 2026).
The reward for each step penalizes redundant tool usage and incentivizes discovery of new, high-relevance papers (as scored by an external model):
where is the probability that paper is relevant to the query, and modulates the penalty for repeating tool calls.
3. System Architecture and Context Management
The PaperScout architecture comprises the following modules:
| Module | Description | Role |
|---|---|---|
| LLM Policy (Qwen3) | Serializes context, generates analysis + tool calls (prompt → response) | Decides next actions |
| Search Executor | Calls search backend (Milvus during training, Google Search in evaluation) | Retrieves candidates |
| Expand Executor | Fetches all outgoing references for a given paper | Expands context |
| Scorer (pasa-7b) | Assigns -score to candidates given current query/context | Selects pool entries |
| Pool Manager | Tracks expanded/unexpanded/candidate papers, supports dual-list context | Maintains state |
| Reward Engine | Filters and scores new candidates, computes | Computes rewards |
The context exposed to the LLM policy at turn includes:
- Query specification
- History of tool invocations
- Top-N expanded and top-N unexpanded papers (with [EXP]/[NEW] tags)
- System prompt listing available tools
This ensures that action selection conditions on the growing retrieval state, with full process-awareness throughout the agent's episode (Pan et al., 15 Jan 2026).
4. Empirical Evaluation and Benchmarks
PaperScout's performance is evaluated on two classes of benchmarks:
- AutoScholarQuery: A synthetic task with 33,551 train / 1,000 dev / 1,000 test natural language queries derived from top-tier publication corpora. The filtered test set included 112 queries with ≥5 ground-truth relevant papers.
- RealScholarQuery: 50 expert-crafted real scholarly queries, each annotated with ground-truth paper sets.
Baselines include Google Search, Google Scholar, RL-based query rewriting systems (PaSa), and LLM-augmented modular workflows (SPAR).
Key findings, as reported in Table 4:
| Model | Precision | F1 | Recall | LLM-score* |
|---|---|---|---|---|
| Google Search | 0.059 | 0.074 | 0.304 | 1.116 |
| PaSa | 0.415 | 0.417 | 0.541 | 2.111 |
| SPAR | 0.412 | 0.408 | 0.496 | 2.415 |
| PaperScout (full RL) | 0.442 | 0.441 | 0.574 | 2.576 |
*LLM-score: average of human-equivalent LLM judgments (0–3). On both synthetic and real benchmarks, PaperScout achieves the highest recall and LLM-score, and tool-use ablations confirm that the RL-fine-tuned agent dominates both single-tool and non-RL LLM backbones (Pan et al., 15 Jan 2026).
5. Analysis, Ablations, and Comparative Optimization
Detailed ablation studies contrast process-level PSPO against PPO and GSPO. On RealScholarQuery:
| Optimizer | Precision | F1 | Recall | LLM-score |
|---|---|---|---|---|
| PPO | 0.405 | 0.408 | 0.537 | 2.417 |
| GSPO | 0.433 | 0.439 | 0.557 | 2.510 |
| PSPO | 0.442 | 0.441 | 0.574 | 2.576 |
PSPO yields both higher recall and improved policy stability, as evidenced by faster return convergence, steadily diminishing gradient norms, and lower critic value regression error. This suggests that sequence-level alignment is essential for stable and performant agentic information retrieval (Pan et al., 15 Jan 2026).
Qualitative examination reveals that the RL-trained PaperScout balances breadth (issuing diverse search queries) and depth (multi-hop expansion along references), breaking out of tunnel vision observed in non-agentic or non-RL approaches.
6. Limitations and Prospective Extensions
PaperScout's current instantiation is validated primarily on computer science literature with open-access corpora. Integrating additional scholarly domains (biomedicine, physics), ingestion of paywalled or proprietary databases (IEEE, ACM), and richer graph-based exploration (co-citation, bibliographic coupling, backward references) are noted as future directions. Presently, only outgoing reference expansion is instrumented, which may limit coverage in certain fields.
Further generalizability could be explored by combining PaperScout's agentic loop with the user modeling, feed generation, and semantic search infrastructures found in systems such as WisPaper (Ju et al., 7 Dec 2025) and Scholar Inbox (Flicke et al., 11 Apr 2025), or with integrity-checking modules as in Problematic Paper Screener (Cabanac et al., 2022). This integration could produce a comprehensive suite for literature discovery, management, and curation.
7. Significance within the Scholarly Search Ecosystem
PaperScout is the first academic search system that combines fully autonomous agentic operation, sequential process awareness, and RL fine-tuning at the sequence level. By re-casting literature search as a stateful, multi-step reasoning-and-action process and addressing the optimization mismatch (token vs. sequence), it establishes a new standard for flexible, high-recall scientific retrieval under open-world and complex-query conditions. Empirical superiority over strong baselines on both recall and judged relevance substantiates its methodological claims and marks a shift toward adaptive, contextually responsive retrieval agents (Pan et al., 15 Jan 2026).