- The paper introduces a TF-IDF based sentence ranking method that selects key sentences from long documents to reduce input size and processing time.
- The approach achieves near-baseline accuracy with only a 0.33% drop by selecting roughly 40-50% of the most informative sentences.
- The method leverages fixed-length, percentage-based, and weighted ranking strategies to efficiently enhance transformer-based classification models.
Improving the Efficiency of Long Document Classification using Sentence Ranking Approach
The paper, authored by Kokate et al., presents an innovative method aimed at enhancing the efficiency of long document classification using a sentence ranking approach rooted in TF-IDF (Term Frequency-Inverse Document Frequency). Transformer-based architectures like BERT, although efficient for short text, struggle with computational limitations when dealing with lengthy documents, primarily due to fixed input lengths and quadratic attention complexity. The authors propose a data-driven methodology that involves selecting only the most informative sentences from these documents, reducing input size and processing time without compromising on the classification accuracy.
The paper's central thesis is that many parts of lengthy documents are redundant for classification purposes, and strategic sentence selection can maintain, or even enhance, classification efficacy. Utilizing BERT and MahaBERT-v2 as the bases for their models, the approach using the MahaNews LDC dataset specifically focuses on Marathi news articles. Results indicate that their method achieved a minimal drop of only 0.33% in classification accuracy compared to the full-context baseline, while offering a significant reduction in input size by over 50% and a decrease in inference latency by 43%.
The methodological framework involves several key strategies for sentence selection:
- Fixed-length Selection: Evaluating predefined sets of top-ranked sentences—1 to 5 per document.
- Percentage-based Selection: Selecting a specific percentage of sentences (10% increments) up to 100%.
- Weighted Ranking: Combining normalized TF-IDF scores with sentence length, adjusting ranking weights based on different weighting factors.
The implementation of these strategies on the MahaNews dataset showed distinct performance improvements compared to more rudimentary approaches like selecting only the first, last, or random sentences. TF-IDF-based sentence ranking, especially when normalized and weighted for sentence length, provides a substantial advantage. The use of contextually relevant TF-IDF scores ensures that only the most pertinent parts of the document are prioritized, thus improving classification accuracy efficiently.
One of the paper’s robust numerical outcomes is that selecting just around 40-50% of the top-ranked sentences achieves accuracy close to the full-context baseline. The method allows significant inference time reduction without tangible losses in classification fidelity. Furthermore, the percentage-based approach showed pronounced improvements when TF-IDF scores were computed for each sentence, reflecting a balance between comprehensive contextual representation and computational efficiency.
The research embraces practical implications in NLP, particularly for applications requiring efficient processing of voluminous texts like legal documents, newspapers, and academic papers. By significantly cutting down processing requirements, this method paves the way for broader applications of transformer models in real-world scenarios demanding fast yet reliable classification.
The paper’s novelty lies in its pragmatic stance: optimizing input selection rather than altering complex model architectures, therefore bypassing associated challenges like increased computational load and implementation complexity. This direction suggests potential for further exploration into similar TF-IDF-based methods across different languages and document types, possibly adapting this approach to hybrids that incorporate both data and model-centric optimizations.
In conclusion, the research provides a compelling solution for long document classification challenges by adeptly applying a data-driven approach to optimize transformer model efficiency. This establishes a foundation for future developments in handling extensive text data in AI, reinforcing the scope of sentence ranking techniques in achieving scalable, reliable, and computationally efficient NLP models.