PaperScout: Autonomous Academic Search

Updated 22 January 2026

PaperScout is an autonomous agentic system that formulates academic paper search as a sequential decision-making process using large language models and reinforcement learning.
It introduces Proximal Sequence Policy Optimization (PSPO) to address sequence-level credit assignment, optimizing retrieval in multi-turn, context-aware environments.
The system dynamically balances exploration and exploitation by invoking search and expansion tools within a POMDP framework, leading to superior recall and relevance.

PaperScout is an autonomous agentic system for academic paper search that formulates retrieval as a sequential decision-making process. In contrast to static pipelined or workflow-driven approaches, PaperScout leverages LLMs optimized via process-aware sequence-level reinforcement learning, enabling adaptive invocation of search and expansion tools based on cumulative context and observed rewards within a Partially Observable Markov Decision Process (POMDP) framework. The core methodological advance is Proximal Sequence Policy Optimization (PSPO), which addresses the granularity mismatch of credit assignment in multi-turn, process-level retrieval environments. Empirical benchmarks demonstrate that this architecture achieves state-of-the-art recall and paper relevance compared to both traditional and agentic baselines (Pan et al., 15 Jan 2026).

1. Agentic Sequential Decision Framework

PaperScout casts the scholarly literature search task as a POMDP with states, actions, transitions, observations, and rewards aligned to the evolving context of a retrieval session. At each timestep, the agent observes a partial summary of the current paper pool—a dual-list view comprising expanded and unexpanded candidate papers, augmented by the full history of tool calls. The two atomic external tools available to the agent are:

Search: accepts a natural-language query and returns a batch of candidate papers from a specified search backend.
Expand: given a paper identifier (e.g., arXiv ID), retrieves its outgoing references (bibliography).

At every turn, the policy $\pi_\theta$ generates a response $y_t$ —a sequence containing a reasoning trace and a set of tool calls. After execution, the environment merges new results into the paper pool, filters by a relevance threshold $\tau$ , computes the marginal utility as the reward $r_t$ , and updates the observation $o_{t+1}$ . This closed-loop, context-sensitive decision process enables the agent to allocate search effort adaptively between breadth (exploration via Search) and depth (exploitation via Expand) (Pan et al., 15 Jan 2026).

2. Proximal Sequence Policy Optimization (PSPO)

Training a multi-turn agent in this regime presents a granularity alignment problem: traditional reinforcement learning (RL) approaches such as Proximal Policy Optimization (PPO) assign credit at the token level, which is not appropriate when rewards accrue per complete sequence response. PSPO directly addresses this by:

Full-sequence credit assignment: The action at each step is the entire output $y_t$ ; all tokens share the same scalar reward $r_t$ ;
Generalized Advantage Estimation (GAE): Computed at the sequence level as

$\hat A_t = \sum_{\ell=0}^{T-t-1}(\gamma\lambda)^\ell\,\delta_{t+\ell}$

with $\delta_t = r_t + \gamma V_\phi(x_{t+1}) - V_\phi(x_t)$ ;

Importance-ratio surrogate loss: Updates are performed via a clipped objective over the sequence-level log-probability ratio $w_t(\theta)$ , ensuring stable policy optimization even under off-policy rollouts;
Critic pretraining and normalization: The value function $V_\phi$ is pre-trained under a fixed (e.g., random or imitation) policy to reduce bias prior to joint actor-critic updates (Pan et al., 15 Jan 2026).

The reward for each step penalizes redundant tool usage and incentivizes discovery of new, high-relevance papers (as scored by an external model):

$r_t = \sum_{p\in \mathrm{top}\text{-}k(\mathcal{V}_t)} \rho(p) - \eta \sum_{c\in \mathcal{C}_t} \mathbf{1}[\,c \in \mathcal{H}_{t-1}\,]$

where $\rho(p)$ is the probability that paper $p$ is relevant to the query, and $\eta$ modulates the penalty for repeating tool calls.

3. System Architecture and Context Management

The PaperScout architecture comprises the following modules:

Module	Description	Role
LLM Policy (Qwen3)	Serializes context, generates analysis + tool calls (prompt → response)	Decides next actions
Search Executor	Calls search backend (Milvus during training, Google Search in evaluation)	Retrieves candidates
Expand Executor	Fetches all outgoing references for a given paper	Expands context
Scorer (pasa-7b)	Assigns $\rho$ -score to candidates given current query/context	Selects pool entries
Pool Manager	Tracks expanded/unexpanded/candidate papers, supports dual-list context	Maintains state
Reward Engine	Filters and scores new candidates, computes $r_t$	Computes rewards

The context exposed to the LLM policy at turn $t$ includes:

Query specification
History of tool invocations
Top-N expanded and top-N unexpanded papers (with [EXP]/[NEW] tags)
System prompt listing available tools

This ensures that action selection conditions on the growing retrieval state, with full process-awareness throughout the agent's episode (Pan et al., 15 Jan 2026).

4. Empirical Evaluation and Benchmarks

PaperScout's performance is evaluated on two classes of benchmarks:

AutoScholarQuery: A synthetic task with 33,551 train / 1,000 dev / 1,000 test natural language queries derived from top-tier publication corpora. The filtered test set included 112 queries with ≥5 ground-truth relevant papers.
RealScholarQuery: 50 expert-crafted real scholarly queries, each annotated with ground-truth paper sets.

Baselines include Google Search, Google Scholar, RL-based query rewriting systems (PaSa), and LLM-augmented modular workflows (SPAR).

Key findings, as reported in Table 4:

Model	Precision	F1	Recall	LLM-score*
Google Search	0.059	0.074	0.304	1.116
PaSa	0.415	0.417	0.541	2.111
SPAR	0.412	0.408	0.496	2.415
PaperScout (full RL)	0.442	0.441	0.574	2.576

*LLM-score: average of human-equivalent LLM judgments (0–3). On both synthetic and real benchmarks, PaperScout achieves the highest recall and LLM-score, and tool-use ablations confirm that the RL-fine-tuned agent dominates both single-tool and non-RL LLM backbones (Pan et al., 15 Jan 2026).

5. Analysis, Ablations, and Comparative Optimization

Detailed ablation studies contrast process-level PSPO against PPO and GSPO. On RealScholarQuery:

Optimizer	Precision	F1	Recall	LLM-score
PPO	0.405	0.408	0.537	2.417
GSPO	0.433	0.439	0.557	2.510
PSPO	0.442	0.441	0.574	2.576

PSPO yields both higher recall and improved policy stability, as evidenced by faster return convergence, steadily diminishing gradient norms, and lower critic value regression error. This suggests that sequence-level alignment is essential for stable and performant agentic information retrieval (Pan et al., 15 Jan 2026).

Qualitative examination reveals that the RL-trained PaperScout balances breadth (issuing diverse search queries) and depth (multi-hop expansion along references), breaking out of tunnel vision observed in non-agentic or non-RL approaches.

6. Limitations and Prospective Extensions

PaperScout's current instantiation is validated primarily on computer science literature with open-access corpora. Integrating additional scholarly domains (biomedicine, physics), ingestion of paywalled or proprietary databases (IEEE, ACM), and richer graph-based exploration (co-citation, bibliographic coupling, backward references) are noted as future directions. Presently, only outgoing reference expansion is instrumented, which may limit coverage in certain fields.

Further generalizability could be explored by combining PaperScout's agentic loop with the user modeling, feed generation, and semantic search infrastructures found in systems such as WisPaper (Ju et al., 7 Dec 2025) and Scholar Inbox (Flicke et al., 11 Apr 2025), or with integrity-checking modules as in Problematic Paper Screener (Cabanac et al., 2022). This integration could produce a comprehensive suite for literature discovery, management, and curation.

7. Significance within the Scholarly Search Ecosystem

PaperScout is the first academic search system that combines fully autonomous agentic operation, sequential process awareness, and RL fine-tuning at the sequence level. By re-casting literature search as a stateful, multi-step reasoning-and-action process and addressing the optimization mismatch (token vs. sequence), it establishes a new standard for flexible, high-recall scientific retrieval under open-world and complex-query conditions. Empirical superiority over strong baselines on both recall and judged relevance substantiates its methodological claims and marks a shift toward adaptive, contextually responsive retrieval agents (Pan et al., 15 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (4)

PaperScout: An Autonomous Agent for Academic Paper Search with Process-Aware Sequence-Level Policy Optimization (2026)

WisPaper: Your AI Scholar Search Engine (2025)

Scholar Inbox: Personalized Paper Recommendations for Scientists (2025)

The 'Problematic Paper Screener' automatically selects suspect publications for post-publication (re)assessment (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PaperScout.