PaSa: An LLM Agent for Comprehensive Academic Paper Search

Published 17 Jan 2025 in cs.IR and cs.LG | (2501.10120v2)

Abstract: We introduce PaSa, an advanced Paper Search agent powered by LLMs. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.

Abstract PDF Upgrade to Chat

Summary

The paper introduces PaSa, a dual-agent LLM system using reinforcement learning to autonomously search and analyze academic papers.
It employs a Crawler that navigates citation networks and a Selector that verifies paper relevance to optimize search performance.
Experimental results demonstrate significant recall improvements over Google and other baselines on both synthetic and real-world query datasets.

The paper introduces PaSa (Paper Search) agent, an LLM powered system designed to enhance academic paper search. PaSa autonomously navigates the paper search process, making decisions such as invoking search tools and analyzing papers to provide comprehensive and accurate results for complex scholarly queries. The architecture consists of two primary LLM agents: the Crawler and the Selector, optimized using reinforcement learning within the AGILE framework.

The Crawler is designed to collect relevant papers by utilizing search tools and extracting citations from papers, adding them to a paper queue. The Crawler iteratively processes each paper, navigating citation networks to discover increasingly relevant papers. The Selector carefully reads each paper in the queue to determine if it meets the requirements of the user query.

Key highlights of PaSa include:

Autonomous Use of Search Tools: PaSa can autonomously use online search tools, read entire papers, and navigate citation networks.
Comprehensive Architecture: It consists of the Crawler and the Selector.
Reinforcement Learning Optimization: PaSa is optimized using reinforcement learning within the AGILE framework.

The paper introduces AutoScholarQuery, a synthetic dataset of academic queries and related papers. The data is sourced from the "related work" sections of papers published at ICLR 2023, ICML 2023, NeurIPS 2023, ACL 2024, and CVPR 2024. AutoScholarQuery includes 33,511/1,000/1,000 query-paper pairs in the training/development/test split. The authors prompted GPT-4o to generate scholarly queries where the answers correspond to the references cited in the related work section. The publication date of the source paper is used as the query date, ensuring that only papers published prior to this date are considered during training and testing.

Additionally, the paper introduces RealScholarQuery, a benchmark consisting of 50 real-world academic queries with annotated relevant papers, to assess PaSa performance in realistic scenarios. The queries were collected from AI researchers using a demo of PaSa, ensuring they are fine-grained and realistic. Relevant papers were gathered manually and through methods including PaSa, Google, and ChatGPT, with professional annotators reviewing and selecting the final set of papers for each query. The query date for all instances in RealScholarQuery is 2024-10-01.

The Crawler operates through a token-level Markov Decision Process (MDP), where the action space $\mathcal{A}$ corresponds to the LLM's vocabulary. The Crawler has three registered functions: [Search], [Expand], and [Stop].

[Search]: Generates a search query and invokes the search tool, appending all resulting papers to the paper queue.
[Expand]: Generates a subsection name and adds all referenced papers in the subsection to the paper queue.
[Stop]: Resets the context to the user query and the next paper in the paper queue.

The reward function for the Crawler is defined as:

$r(s_t, a_t) = \alpha\times\sum_{i=1}^{n_t} \mathbb{I}(q, p_i, t) - c(a_t)$

$r(s_t, a_t)$ is the reward of executing action $a_t$ in state $s_t$
$\alpha$ is a reward coefficient
$\mathbb{I}(q, p_i, t)$ is an indicator function that equals 1 if $p_i$ matches query $q$ and is not already in the queue, and 0 otherwise
$c(a_t)$ is the cost of action $a_t$

To address the limitations of AutoScholarQuery potentially including only a subset of ground-truth papers, the Selector acts as an auxiliary reward model for the Crawler. The revised definition of $\mathbb{I}(q, p_i, t)$ is:

$\mathbb{I}(q, p_i, t) = \begin{cases} 1, & \text{if } \left( \text{Selector}(q, p_i) = 1 \text{ or } p_i \in \mathcal{P} \right) \ & \quad \text{and } p_i \notin \mathcal{Q}_t, \ 0, & \text{otherwise.} \end{cases}$ - $\text{Selector}(q, p_i)=1$ if paper $p_i$ is identified as correct to meet the query $q$ by the Selector, and $\text{Selector}(q, p_i)=0$ otherwise

$\mathcal{P}$ is a set of papers
$\mathcal{Q}_t$ is the paper queue at time $t$

The Crawler is trained using a session-level Proximal Policy Optimization (PPO) algorithm to address challenges such as sparse rewards and long trajectories. A session is defined as a sub-trajectory that begins with a session's initial state and ends with the [Stop] action. The Crawler is modeled as a policy $\pi_\theta(a_t|s_t)$ , and the entire trajectory $\tau$ is partitioned into a sequence of sessions.

The return in the session is estimated using Monte Carlo sampling:

$\hat{R}_t = \sum_{k=0}^{t_{i+1}-1-t}\gamma_0^k\bigg[r(s_{t+k}, a_{t+k}) + \gamma_1\sum_{j=1}^{n_{t+k}\hat{V}_\phi(S_{q+p_j}) - \beta\cdot\log\frac{\pi_\theta(a_t|s_t)}{\pi_{\rm sft}(a_t|s_t)}\bigg]$

$\hat{R}_t$ is the return at time step $t$
$\gamma_0$ is the in-session discount factor
$\gamma_1$ is the across-session discount factor
$\hat{V}_\phi(\cdot)$ is the value function model to approximate the state value
$\beta$ is the coefficient to scale the KL penalty term
$\pi_\theta$ is the learned policy
$\pi_{\rm sft}$ is the initial policy obtained through imitation learning

The advantage function is approximated by:

$\hat{A}(s_t, a_t)= \hat{R}_t-\hat{V}_\phi(s_t)$

$\hat{A}(s_t, a_t)$ is the advantage function

The policy and value objectives are given by: $\mathcal{L}{\text{policy}(\theta)=!\mathbb{E}{\tau'\sim\pi_{\theta}\text{old}\Bigg[\min\bigg(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta}\text{old}(a_t|s_t)}\hat{A}(s_t, a_t),\ \text{clip}\Big(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta}\text{old}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\Big)\hat{A}(s_t, a_t)\bigg)\Bigg]$$\mathcal{L}{\text{value}(\phi)=\mathbb{E}{\tau'\sim\pi_{\theta}^{\text{old}\Bigg[\rm{max}\bigg(\Big(\hat{R}t-\hat{V}\phi(s_t)\Big)^2,} \Big(\hat{R}t-\hat{V}{\phi}^{{\rm{clip}(s_t)\Big)^{2\bigg)\Bigg] $ $ \hat{V}_{\phi}^{\rm{clip}(s_t)=\text{clip}\Big(\hat{V}_{\phi}(s_t),V_{\phi}^{\text{old}(s_t)-\epsilon,V_{\phi}^{\text{old}(s_t)+\epsilon\Big) $ <ul> <li>$ \mathcal{L}_{\text{policy}(\theta)} $is the policy loss</li> <li>$ \mathcal{L}_{\text{value}(\phi)} $is the value loss</li> <li>$ \pi_{\theta}^\text{old} $and$ V_{\phi}^{\text{old} $is used for sampling and$ \tau' $is session trajectory</li> </ul> The unified RL loss is: $ \mathcal{L}_{\text{RL}(\theta, \phi)=\mathcal{L}_{\text{policy}(\theta)+\eta\cdot\mathcal{L}_{\text{value}(\phi)} $ <ul> <li>$ \eta$ is the coefficient of the value objective</li>
</ul>

The Selector is an <a href="https://www.emergentmind.com/topics/llm-agent" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">LLM agent</a> that generates a decision token ("True" or "False") and a rationale. The decision token indicates whether the paper satisfies the query, and the rationale provides supporting evidence. Imitation learning is used to optimize the Selector.

Experiments were conducted using <a href="https://www.emergentmind.com/topics/qwen2-5-7b" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Qwen2.5-7b</a> as the base LLM for both the Selector and Crawler. PaSa-7b surpasses all baselines. On the AutoScholarQuery test set, PaSa-7b shows a 34.05\% improvement in Recall@20 and a 39.36\% improvement in Recall@50 compared to Google with GPT-4o. PaSa-7b outperforms PaSa-GPT-4o by 11.12% in recall, with similar precision. On RealScholarQuery, PaSa-7b outperforms Google with GPT-4o by 37.78% in Recall@20 and 39.90% in <a href="mailto:Recall@50" rel="nofollow noopener">Recall@50</a>. PaSa-7b surpasses PaSa-GPT-4o by 30.36% in recall and 4.25% in precision.

Ablation studies demonstrate that removing the [Expand] action from the Crawler leads to a 22.98% decrease on AutoScholarQuery and a 32.21% decrease on RealScholarQuery. Removing the Selector as an auxiliary reward model results in a 3.76% recall drop on AutoScholarQuery and a 9.63% drop on RealScholarQuery. Adjusting the reward coefficient $\alpha$ in the RL training effectively influences PaSa's behavior, with higher rewards leading to increased crawler recall and action.}}