NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents
Abstract: While (large) LLMs have significantly improved over the last years, they still struggle to sensibly process long sequences found, e.g., in books, due to the quadratic scaling of the underlying attention mechanism. To address this, we propose NextLevelBERT, a Masked LLM operating not on tokens, but on higher-level semantic representations in the form of text embeddings. We pretrain NextLevelBERT to predict the vector representation of entire masked text chunks and evaluate the effectiveness of the resulting document vectors on three types of tasks: 1) Semantic Textual Similarity via zero-shot document embeddings, 2) Long document classification, 3) Multiple-choice question answering. We find that next-level Masked Language Modeling is an effective technique to tackle long-document use cases and can outperfor much larger embedding models as long as the required level of detail of semantic information is not too fine. Our models and code are publicly available online.
- Longformer: The Long-Document Transformer. ArXiv:2004.05150 [cs].
- Margarita Bugueño and Gerard de Melo. 2023. Connecting the dots: What graph-based text representations work best for text classification using graph neural networks? In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8943–8960, Singapore. Association for Computational Linguistics.
- Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature Human Behaviour, 7(3):430–441. Number: 3 Publisher: Nature Publishing Group.
- Extending Context Window of Large Language Models via Positional Interpolation. ArXiv:2306.15595 [cs].
- Revisiting transformer-based models for long document classification. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7212–7230, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Matthew H. Davis and Ingrid S. Johnsrude. 2003. Hierarchical Processing in Spoken Language Comprehension. The Journal of Neuroscience, 23(8):3423–3431.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Structures, Not Strings: Linguistics as Part of the Cognitive Sciences. Trends in Cognitive Sciences, 19(12):729–743.
- Karl Friston and Stefan Kiebel. 2009. Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521):1211–1221. Publisher: Royal Society.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. ArXiv:2101.00027 [cs].
- HyperAttention: Long-context Attention in Near-Linear Time. ArXiv:2310.05869 [cs].
- George Hudson and Noura Al Moubayed. 2022. MuLD: The Multitask Long Document Benchmark. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3675–3685, Marseille, France. European Language Resources Association.
- Efficient Long-Text Understanding with Short-Text Models. ArXiv:2208.00748 [cs].
- SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
- Reformer: The Efficient Transformer.
- BookSum: A Collection of Datasets for Long-form Narrative Summarization. arXiv:2105.08209 [cs]. ArXiv: 2105.08209.
- K. Lashley. 1951. The problem of serial order in behavior.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv.
- Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization.
- QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5336–5358, Seattle, United States. Association for Computational Linguistics.
- Hierarchical Transformers for Long Document Classification. arXiv:1910.10781 [cs, stat]. ArXiv: 1910.10781.
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. ArXiv:2108.12409 [cs].
- Compressive Transformers for Long-Range Sequence Modelling. arXiv:1911.05507 [cs, stat]. ArXiv: 1911.05507.
- Nils Reimers and Iryna Gurevych. 2019a. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. ArXiv:1908.10084 [cs].
- Nils Reimers and Iryna Gurevych. 2019b. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information. Publisher: arXiv Version Number: 1.
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv.
- MPNet: Masked and Permuted Pre-training for Language Understanding. ArXiv:2004.09297 [cs].
- Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. ArXiv:2002.10957 [cs].
- Should you mask 15% in masked language modeling? In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2985–3000, Dubrovnik, Croatia. Association for Computational Linguistics.
- PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5245–5263, Dublin, Ireland. Association for Computational Linguistics.
- Unsupervised Extractive Summarization by Pre-training Hierarchical Transformers. Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1784–1795. Conference Name: Findings of the Association for Computational Linguistics: EMNLP 2020 Place: Online Publisher: Association for Computational Linguistics.
- ReadTwice: Reading Very Large Documents with Memories. arXiv:2105.04241 [cs]. ArXiv: 2105.04241.
- Poolingformer: Long Document Modeling with Pooling Attention. page 10.
- HEGEL: Hypergraph Transformer for Long Document Summarization. ArXiv:2210.04126 [cs].
- HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5059–5069, Florence, Italy. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.