DocBERT: BERT for Document Classification

Published 17 Apr 2019 in cs.CL | (1904.08398v3)

Abstract: We present, to our knowledge, the first application of BERT to document classification. A few characteristics of the task might lead one to think that BERT is not the most appropriate model: syntactic structures matter less for content categories, documents can often be longer than typical BERT input, and documents often have multiple labels. Nevertheless, we show that a straightforward classification model using BERT is able to achieve the state of the art across four popular datasets. To address the computational expense associated with BERT inference, we distill knowledge from BERT-large to small bidirectional LSTMs, reaching BERT-base parity on multiple datasets using 30x fewer parameters. The primary contribution of our paper is improved baselines that can provide the foundation for future work.

Abstract PDF Upgrade to Chat

Citations (289)

View on Semantic Scholar

Summary

The paper introduces a BERT-based model that sets new benchmarks on both multi-label and single-label document classification tasks.
It applies knowledge distillation to transfer BERT’s capabilities to a simpler BiLSTM model, achieving up to 30x fewer parameters and 40x faster inference.
Experimental results on Reuters-21578, AAPD, IMDB, and Yelp 2014 confirm BERT's superiority over traditional methods like Hierarchical Attention Networks and XML-CNN.

Analysis of "DocBERT: BERT for Document Classification"

The paper "DocBERT: BERT for Document Classification" by Ashutosh Adhikari et al. explores the applicability of BERT, a progressive pre-trained language representation model, to the task of document classification. The study addresses challenges such as longer document sizes and multi-label nature, which might be perceived as barriers to fine-tuning BERT. Despite these hurdles, the paper demonstrates that BERT fine-tuned models surpass existing baselines on multiple datasets.

Core Contributions

The authors make a dual contribution with their research:

Implementation of BERT for Document Classification: They present a model that leverages BERT for the task at hand, setting new performance benchmarks on four widely-used datasets: Reuters-21578, AAPD, IMDB, and Yelp 2014. Fine-tuning BERT in this context showcases its efficacy despite the perceived complexities and constraints associated with document classification.
Knowledge Distillation Mechanism: To mitigate the computational overhead inherent in using large models like BERT, the authors apply knowledge distillation to transfer information from BERT into a simpler BiLSTM model. This distilled model, labeled KD-reg, yields performance comparable to the BERT base variant with significantly fewer parameters.

Experimental Insights

The study examines various established document classification approaches, comparing them against BERT fine-tuned models. BERT, notably the large variant, achieves superior F-scores and accuracy across the datasets, outperforming sophisticated neural architectures such as Hierarchical Attention Networks, XML-CNN, and the simpler logistic regression and SVM baselines.

Dataset Characteristics and Model Performance:

Reuters-21578 and AAPD: These datasets present a multi-label classification challenge where fine-tuned BERT models achieved notably higher F-scores compared to previous methodologies. Notably, the KD-reg model reached near parity with the base BERT model on these datasets, underscoring the success of the distillation process.
IMDB and Yelp 2014: The single-label classification tasks on these datasets further validate BERT's capability, with the large variant markedly improving classification accuracy. The KD-reg model, while trailing BERT large, significantly narrows the performance gap when compared to other models, reflecting its efficiency in mimicking the BERT base's learned insights.

Computational Efficiency and Implications

The paper candidly addresses the notable computational costs associated with fine-tuning BERT. Leveraging knowledge distillation, the authors effectively reduce the number of parameters by 30 times and improve inference speed by 40 times with KD-reg, making high-performing models more accessible and feasible for deployment in resource-constrained environments.

Future Directions

Potential research trajectories include extending knowledge distillation to diverse neural architectures beyond BiLSTM and innovating compression techniques specific to transformer-based models. Such advancements could further enhance model efficiency without compromising performance, paving the way for broader application of pre-trained models in computationally intensive environments.

In conclusion, this paper contributes significantly to the field by not only achieving high-performance document classification with BERT but also presenting a viable approach to reduce computational demands through distilled models. These findings suggest promising pathways for future work in optimizing deep learning models for practical, large-scale NLP tasks.