Switchable Decision: Dynamic Neural Generation Networks
Abstract: Auto-regressive generation models achieve competitive performance across many different NLP tasks such as summarization, question answering, and classifications. However, they are also known for being slow in inference, which makes them challenging to deploy in real-time applications. We propose a switchable decision to accelerate inference by dynamically assigning computation resources for each data instance. Automatically making decisions on where to skip and how to balance quality and computation cost with constrained optimization, our dynamic neural generation networks enforce the efficient inference path and determine the optimized trade-off. Experiments across question answering, summarization, and classification benchmarks show that our method benefits from less computation cost during inference while keeping the same accuracy. Extensive experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
- Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. arXiv preprint arXiv:2310.05424, 2023.
- Skip rnn: Learning to skip state updates in recurrent neural networks. arXiv preprint arXiv:1708.06834, 2017.
- Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
- The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp. 177–190. Springer, 2005.
- One model, multiple modalities: A sparsely activated approach for text, sound, image, video and code. arXiv preprint arXiv:2205.06126, 2022.
- Bert: pre-training of deep bidirectional transformers for language understanding. arxiv. arXiv preprint arXiv:1810.04805, 2018.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
- Unified language model pre-training for natural language understanding and generation. Advances in Neural Information Processing Systems, 32, 2019.
- Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
- Bayesian attention modules. arXiv preprint arXiv:2010.10604, 2020.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021.
- Fantastic rewards and how to tame them: A case study on reward learning for task-oriented dialogue systems. arXiv preprint arXiv:2302.10342, 2023.
- Speed reading: Learning to read forbackward via shuttle. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4439–4448, 2018.
- Lossless acceleration for seq2seq generation with aggressive decoding. arXiv preprint arXiv:2205.10350, 2022.
- Compressing bert: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307, 2020.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
- Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015b.
- Neural speed reading with structural-jump-lstm. arXiv preprint arXiv:1904.00761, 2019.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28, 2015.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
- Token dropping for efficient bert pretraining. arXiv preprint arXiv:2203.13240, 2022.
- Distilling knowledge from reader to retriever for question answering. In ICLR 2021, 9th International Conference on Learning Representations, 2021.
- Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942, 2020.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Dq-bart: Efficient sequence-to-sequence model via joint distillation and quantization. arXiv preprint arXiv:2203.11239, 2022.
- Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, pp. 150–157, 2003.
- Fixed point quantization of deep convolutional networks. In International conference on machine learning, pp. 2849–2858. PMLR, 2016.
- Runtime neural pruning. Advances in neural information processing systems, 30, 2017.
- Fastbert: a self-distilling bert with adaptive inference time. arXiv preprint arXiv:2004.02178, 2020a.
- The microsoft toolkit of multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:2002.07972, 2020b.
- Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization. arXiv preprint arXiv:2112.01573, 2021.
- Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345, 2019.
- Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019a.
- Roberta: A robustly optimized bert pretraining approach. arXiv e-prints, pp. arXiv–1907, 2019b.
- The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60, 2014.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
- Squad: 100,000+ questions for machine comprehension of text. Empirical Methods in Natural Language Processing (EMNLP), 2016.
- Know what you don’t know: Unanswerable questions for squad. Annual Meetings of the Association for Computational Linguistics (ACL), 2018.
- Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472, 2022.
- The right tool for the job: Matching model and instance complexities. arXiv preprint arXiv:2004.07453, 2020.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 8815–8821, 2020.
- Pre-trained summarization distillation. arXiv preprint arXiv:2010.13002, 2020.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.
- Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.
- Pouf: Prompt-oriented unsupervised fine-tuning for large pre-trained models. In International Conference on Machine Learning, pp. 33816–33832. PMLR, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018a.
- Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 409–424, 2018b.
- A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, 2017a.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017b.
- Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8817–8826, 2018.
- Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993, 2020.
- Berxit: Early exiting for bert with better fine-tuning and extension to regression. In Proceedings of the 16th conference of the European chapter of the association for computational linguistics: Main Volume, pp. 91–104, 2021.
- Regularizing a model-based policy stationary distribution to stabilize offline reinforcement learning. In International Conference on Machine Learning, pp. 24980–25006. PMLR, 2022a.
- A unified framework for alternating offline model training and policy learning. Advances in Neural Information Processing Systems, 35:17216–17232, 2022b.
- Preference-grounded token-level guidance for language model fine-tuning. Advances in Neural Information Processing Systems, 36, 2024.
- Fast and accurate text classification: Skimming, rereading and early stopping. 2018.
- Are all layers created equal? 2019.
- Bayesian attention belief networks. In International Conference on Machine Learning, pp. 12413–12426. PMLR, 2021a.
- Alignment attention by matching key and query distributions. Advances in Neural Information Processing Systems, 34:13444–13457, 2021b.
- Knowing more about questions can help: Improving calibration in question answering. arXiv preprint arXiv:2106.01494, 2021c.
- Learning with different amounts of annotation: From zero to many labels. arXiv preprint arXiv:2109.04408, 2021d.
- Passage-mask: A learnable regularization strategy for retriever-reader models. arXiv preprint arXiv:2211.00915, 2022a.
- Allsh: Active learning guided by local sensitivity and hardness. arXiv preprint arXiv:2205.04980, 2022b.
- Language rectified flow: Advancing diffusion language generation with probabilistic flows. arXiv preprint arXiv:2403.16995, 2024.
- Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems, 33:18330–18341, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.