Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

Published 8 Jun 2025 in cs.CL and cs.LG | (2506.07248v2)

Abstract: Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent. This demonstrates that significant context reduction is possible without sacrificing performance, making the method practical for real-world long document classification tasks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a TF-IDF based sentence ranking method that selects key sentences from long documents to reduce input size and processing time.
The approach achieves near-baseline accuracy with only a 0.33% drop by selecting roughly 40-50% of the most informative sentences.
The method leverages fixed-length, percentage-based, and weighted ranking strategies to efficiently enhance transformer-based classification models.

Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

The paper, authored by Kokate et al., presents an innovative method aimed at enhancing the efficiency of long document classification using a sentence ranking approach rooted in TF-IDF (Term Frequency-Inverse Document Frequency). Transformer-based architectures like BERT, although efficient for short text, struggle with computational limitations when dealing with lengthy documents, primarily due to fixed input lengths and quadratic attention complexity. The authors propose a data-driven methodology that involves selecting only the most informative sentences from these documents, reducing input size and processing time without compromising on the classification accuracy.

The paper's central thesis is that many parts of lengthy documents are redundant for classification purposes, and strategic sentence selection can maintain, or even enhance, classification efficacy. Utilizing BERT and MahaBERT-v2 as the bases for their models, the approach using the MahaNews LDC dataset specifically focuses on Marathi news articles. Results indicate that their method achieved a minimal drop of only 0.33% in classification accuracy compared to the full-context baseline, while offering a significant reduction in input size by over 50% and a decrease in inference latency by 43%.

The methodological framework involves several key strategies for sentence selection:

Fixed-length Selection: Evaluating predefined sets of top-ranked sentences—1 to 5 per document.
Percentage-based Selection: Selecting a specific percentage of sentences (10% increments) up to 100%.
Weighted Ranking: Combining normalized TF-IDF scores with sentence length, adjusting ranking weights based on different weighting factors.

The implementation of these strategies on the MahaNews dataset showed distinct performance improvements compared to more rudimentary approaches like selecting only the first, last, or random sentences. TF-IDF-based sentence ranking, especially when normalized and weighted for sentence length, provides a substantial advantage. The use of contextually relevant TF-IDF scores ensures that only the most pertinent parts of the document are prioritized, thus improving classification accuracy efficiently.

One of the paper’s robust numerical outcomes is that selecting just around 40-50% of the top-ranked sentences achieves accuracy close to the full-context baseline. The method allows significant inference time reduction without tangible losses in classification fidelity. Furthermore, the percentage-based approach showed pronounced improvements when TF-IDF scores were computed for each sentence, reflecting a balance between comprehensive contextual representation and computational efficiency.

The research embraces practical implications in NLP, particularly for applications requiring efficient processing of voluminous texts like legal documents, newspapers, and academic papers. By significantly cutting down processing requirements, this method paves the way for broader applications of transformer models in real-world scenarios demanding fast yet reliable classification.

The paper’s novelty lies in its pragmatic stance: optimizing input selection rather than altering complex model architectures, therefore bypassing associated challenges like increased computational load and implementation complexity. This direction suggests potential for further exploration into similar TF-IDF-based methods across different languages and document types, possibly adapting this approach to hybrids that incorporate both data and model-centric optimizations.

In conclusion, the research provides a compelling solution for long document classification challenges by adeptly applying a data-driven approach to optimize transformer model efficiency. This establishes a foundation for future developments in handling extensive text data in AI, reinforcing the scope of sentence ranking techniques in achieving scalable, reliable, and computationally efficient NLP models.

Markdown Report Issue