Remember This Event That Year? Assessing Temporal Information and Reasoning in Large Language Models
Abstract: LLMs are increasingly ubiquitous, yet their ability to retain and reason about temporal information remains limited, hindering their application in real-world scenarios where understanding the sequential nature of events is crucial. Our study experiments with 12 state-of-the-art models (ranging from 2B to 70B+ parameters) on a novel numerical-temporal dataset, \textbf{TempUN}, spanning from 10,000 BCE to 2100 CE, to uncover significant temporal retention and comprehension limitations. We propose six metrics to assess three learning paradigms to enhance temporal knowledge acquisition. Our findings reveal that open-source models exhibit knowledge gaps more frequently, suggesting a trade-off between limited knowledge and incorrect responses. Additionally, various fine-tuning approaches significantly improved performance, reducing incorrect outputs and impacting the identification of 'information not available' in the generations. The associated dataset and code are available at (https://github.com/lingoiitgn/TempUN).
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Oshin Agarwal and Ani Nenkova. 2022. Temporal effects on pre-trained models for language processing tasks. Transactions of the Association for Computational Linguistics, 10:904–921.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Continual lifelong learning in natural language processing: A survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6523–6541, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
- A dataset for answering time-sensitive questions. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Time-Aware Language Models as Temporal Knowledge Bases. Transactions of the Association for Computational Linguistics, 10:257–273.
- Large language models are zero-shot time series forecasters. arXiv preprint arXiv:2310.07820.
- Do language models have a common sense regarding time? revisiting temporal commonsense reasoning in the era of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6750–6774.
- TemporalWiki: A lifelong benchmark for training and evaluating ever-evolving language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6237–6250, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Timotej Knez and Slavko Žitnik. 2023. Event-centric temporal knowledge graph construction: A survey. Mathematics, 11(23):4852.
- Unlocking temporal question answering for large language models using code execution.
- OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
- Are large language models temporally grounded?
- Guy D. Rosin and Kira Radinsky. 2022. Temporal attention for language models. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1498–1508, Seattle, United States. Association for Computational Linguistics.
- Towards benchmarking and improving the temporal reasoning capability of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14820–14835, Toronto, Canada. Association for Computational Linguistics.
- Towards robust temporal reasoning of large language models via a multi-hop qa dataset and pseudo-instruction tuning. arXiv preprint arXiv:2311.09821.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Yuqing Wang and Yun Zhao. 2023. Tram: Benchmarking temporal reasoning for large language models. arXiv preprint arXiv:2310.00835.
- MenatQA: A new dataset for testing the temporal comprehension and reasoning abilities of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1434–1447, Singapore. Association for Computational Linguistics.
- Large language models can learn temporal reasoning. arXiv preprint arXiv:2401.06853.
- Harnessing LLMs for temporal data - a study on explainable financial time series forecasting. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 739–753, Singapore. Association for Computational Linguistics.
- Back to the future: Towards explainable temporal reasoning with large language models. arXiv preprint arXiv:2310.01074.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.