LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History
Abstract: With the recent emergence of powerful instruction-tuned LLMs, various helpful conversational AI systems have been deployed across many applications. When prompted by users, these AI systems successfully perform a wide range of tasks as part of a conversation. To provide some sort of memory and context, such approaches typically condition their output on the entire conversational history. Although this sensitivity to the conversational history can often lead to improved performance on subsequent tasks, we find that performance can in fact also be negatively impacted, if there is a task-switch. To the best of our knowledge, our work makes the first attempt to formalize the study of such vulnerabilities and interference of tasks in conversational LLMs caused by task-switches in the conversational history. Our experiments across 5 datasets with 15 task switches using popular LLMs reveal that many of the task-switches can lead to significant performance degradation.
- What learning algorithm is in-context learning? investigations with linear models. arXiv.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Constitutional ai: Harmlessness from ai feedback. arXiv.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv.
- Transformers generalize differently from information stored in context vs in weights. arXiv preprint arXiv:2210.05675.
- Ting-Yun Chang and Robin Jia. 2023. Data curation alone can stabilize in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8123–8144.
- A framework for few-shot language model evaluation.
- Making pre-trained language models better few-shot learners. arXiv.
- English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.
- Michael Hahn and Navin Goyal. 2023. A theory of emergent in-context learning as implicit structure induction. arXiv.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191.
- Mistral 7b. arXiv.
- Prompt packer: Deceiving llms through compositional instruction with hidden attacks. arXiv preprint arXiv:2310.10077.
- Sgd-x: A benchmark for robust generalization in schema-guided dialogue systems. In AAAI, volume 36, pages 10938–10946.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Towards informative few-shot prompt with maximum information gain for in-context learning. arXiv preprint arXiv:2310.08923.
- What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
- Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
- James Manyika and Sissie Hsiao. 2023. An overview of bard: an early experiment with generative ai. AI. Google Static Documents, 2.
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
- RÂ OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13.
- Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv.
- What spurious features can pretrained language models combat?
- Large language models encode clinical knowledge. arXiv.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
- Large language models can be lazy learners: Analyze shortcuts in in-context learning. ACL Findings.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- Llama: Open and efficient foundation language models. arXiv.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR.
- Jason Weston and Sainbayar Sukhbaatar. 2023. System 2 attention (is something you might need too). arXiv.
- Tiage: A benchmark for topic-shift aware dialog modeling. ACL.
- Tweetqa: A social media focused question answering dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- An llm can fool itself: A prompt-based adversarial attack. arXiv preprint arXiv:2310.13345.
- Topic-aware multi-turn dialogue modeling. In AAAI, volume 35, pages 14176–14184.
- Take: topic-shift aware knowledge selection for dialogue generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 253–265.
- Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.