SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics

Published 2 Oct 2024 in cs.CL | (2410.01946v1)

Abstract: Prompt-based fine-tuning has become an essential method for eliciting information encoded in pre-trained LLMs for a variety of tasks, including text classification. For multi-class classification tasks, prompt-based fine-tuning under low-resource scenarios has resulted in performance levels comparable to those of fully fine-tuning methods. Previous studies have used crafted prompt templates and verbalizers, mapping from the label terms space to the class space, to solve the classification problem as a masked language modeling task. However, cross-domain and fine-grained prompt-based fine-tuning with an automatically enriched verbalizer remains unexplored, mainly due to the difficulty and costs of manually selecting domain label terms for the verbalizer, which requires humans with domain expertise. To address this challenge, we introduce SciPrompt, a framework designed to automatically retrieve scientific topic-related terms for low-resource text classification tasks. To this end, we select semantically correlated and domain-specific label terms within the context of scientific literature for verbalizer augmentation. Furthermore, we propose a new verbalization strategy that uses correlation scores as additional weights to enhance the prediction performance of the LLM during model tuning. Our method outperforms state-of-the-art, prompt-based fine-tuning methods on scientific text classification tasks under few and zero-shot settings, especially in classifying fine-grained and emerging scientific topics.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a novel multi-stage verbalizer strategy that integrates external knowledge bases to enhance prompt-based fine-tuning in scientific text classification.
It employs domain-adaptive filtering via an NLI model and weighted correlation scores to map mask language model outputs accurately to class labels.
Results show that SciPrompt outperforms baseline models in few and zero-shot scenarios, offering robust solutions for low-resource scientific datasets.

An Overview of Knowledge-Augmented Prompting in Scientific Text Classification

The paper "SciPrompt: Knowledge-Augmented Prompting for Fine-Grained Categorization of Scientific Topics" addresses a significant challenge in the field of scientific text classification: the classification of scientific abstracts into specific domains, particularly in low-resource scenarios. This study introduces a framework, SciPrompt, which harnesses domain-specific knowledge to enhance prompt-based fine-tuning methods, thereby optimizing label verbalization in models.

Methodological Insights

The principal innovation introduced by SciPrompt is its advanced verbalization strategy, designed to enrich the mapping from mask LLMs (MLMs) predictions to class labels. Previous methodologies have been constrained by manually crafted verbalizers, often burdened by the need for extensive domain expertise. SciPrompt circumvents these limitations using a multi-stage approach optimized for scientific literature:

Knowledge Retrieval: The framework taps into external knowledge bases (KBs), namely Related Words and Reverse Dictionary, to obtain scientifically relevant terms or phrases. These augment the model’s ability to comprehend and classify text by integrating domain lexicon into the verbalizer.
Domain-Adaptive Filtering: SciPrompt refines its dataset of terms through a fine-tuned Natural Language Inference (NLI) model, establishing semantic relationships between retrieved phrases and class labels. This filtering mechanism ensures only the most relevant terms contribute to the LLM's training, thereby enhancing prediction accuracy.
Weighted Verbalizer Approach: To map model predictions more effectively, SciPrompt employs a novel verbalization strategy. It uses correlation scores as weights when processing the LLMs’ outputs, refining the translation of text predictions into class probabilities.

Results and Evaluation

SciPrompt's performance was rigorously evaluated across several scientific datasets (e.g., SDPRA 2021, arXiv, S2ORC). The results were notable:

In few and zero-shot settings, SciPrompt consistently outperformed baseline methods in classifying fine-grained scientific topics, indicating its effectiveness under data-scarce conditions.
The introduction of a phrase-level verbalizer significantly enhanced classification accuracy compared to existing token-level verbalizers. This is particularly critical in domains where nuanced understanding of terminology influences classification outcomes.
Compared to state-of-the-art models, SciPrompt demonstrated continual improvements, especially visible in scenarios lacking significant labeled data.

Implications and Future Trajectories

From a theoretical perspective, the SciPrompt framework advances our understanding of how domain-specific knowledge can be efficiently leveraged to improve machine learning model accuracy. Practically, this study provides a blueprint for integrating external data sources into NLP models, offering a pathway to overcome challenges associated with limited datasets.

Looking ahead, the strategies outlined in SciPrompt hold promise for broader applications beyond scientific text classification. Future work could involve adapting this framework to multi-label classification scenarios or expanding its capabilities to encompass other complex, hierarchical classification tasks in interdisciplinary or emerging fields of study.

In conclusion, SciPrompt marks a substantive contribution to the field of natural language processing, particularly within the specialization of scientific literature. Its ability to integrate augmented data retrieval and strategic verbalization underscores the potential for more intelligent, resourceful AI tools capable of parsing highly specialized text with minimal labeled input.

Markdown Report Issue