Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLMParser: An Exploratory Study on Using Large Language Models for Log Parsing

Published 27 Apr 2024 in cs.SE and cs.AI | (2404.18001v1)

Abstract: Logs are important in modern software development with runtime information. Log parsing is the first step in many log-based analyses, that involve extracting structured information from unstructured log data. Traditional log parsers face challenges in accurately parsing logs due to the diversity of log formats, which directly impacts the performance of downstream log-analysis tasks. In this paper, we explore the potential of using LLMs for log parsing and propose LLMParser, an LLM-based log parser based on generative LLMs and few-shot tuning. We leverage four LLMs, Flan-T5-small, Flan-T5-base, LLaMA-7B, and ChatGLM-6B in LLMParsers. Our evaluation of 16 open-source systems shows that LLMParser achieves statistically significantly higher parsing accuracy than state-of-the-art parsers (a 96% average parsing accuracy). We further conduct a comprehensive empirical analysis on the effect of training size, model size, and pre-training LLM on log parsing accuracy. We find that smaller LLMs may be more effective than more complex LLMs; for instance where Flan-T5-base achieves comparable results as LLaMA-7B with a shorter inference time. We also find that using LLMs pre-trained using logs from other systems does not always improve parsing accuracy. While using pre-trained Flan-T5-base shows an improvement in accuracy, pre-trained LLaMA results in a decrease (decrease by almost 55% in group accuracy). In short, our study provides empirical evidence for using LLMs for log parsing and highlights the limitations and future research direction of LLM-based log parsers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. GitHub (2023).
  2. Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Computer Methods and Programs in Biomedicine 213 (2022), 106504. https://doi.org/10.1016/j.cmpb.2021.106504
  3. Stability of topic modeling via matrix factorization. Expert Systems with Applications 91 (2018), 159–169. https://doi.org/10.1016/j.eswa.2017.08.047
  4. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  5. An experience report of generating load tests using log-recovered workloads at varying granularities of user behaviour. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, IEEE, 669–681.
  6. Song Chen and Hai Liao. 2022. Bert-log: Anomaly detection for system logs based on pre-trained language model. Applied Artificial Intelligence 36, 1 (2022), 2145642.
  7. Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents. arXiv preprint arXiv:2305.15778 (2023).
  8. Yizong Cheng. 1995. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 8 (1995), 790–799. https://doi.org/10.1109/34.400568
  9. INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models. arXiv preprint arXiv:2306.04757 (2023).
  10. Scaling Instruction-Finetuned Language Models. https://doi.org/10.48550/ARXIV.2210.11416
  11. Emerging trends: A gentle introduction to fine-tuning. Natural Language Engineering 27, 6 (2021), 763–778.
  12. Logram: Efficient Log Parsing Using n𝑛nitalic_n n-Gram Dictionaries. IEEE Transactions on Software Engineering 48, 3 (2020), 879–892.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  14. Openprompt: An open-source framework for prompt-learning. arXiv preprint arXiv:2111.01998 (2021).
  15. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305 (2020).
  16. A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022).
  17. Min Du and Feifei Li. 2019. Spell: Online Streaming Parsing of Large Unstructured System Logs. IEEE Transactions on Knowledge and Data Engineering 31, 11 (2019), 2213–2227. https://doi.org/10.1109/TKDE.2018.2875442
  18. Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759 (2023).
  19. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In 2009 Ninth IEEE International Conference on Data Mining. 149–158. https://doi.org/10.1109/ICDM.2009.60
  20. Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 3816–3830. https://doi.org/10.18653/v1/2021.acl-long.295
  21. Diversity in Machine Learning. IEEE Access 7 (2019), 64323–64350. https://doi.org/10.1109/ACCESS.2019.2917620
  22. Characterizing the natural language descriptions in software logging statements. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 178–189.
  23. Towards automated log parsing for large-scale log data analysis. IEEE Transactions on Dependable and Secure Computing 15, 6 (2017), 931–944.
  24. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In 2017 IEEE International Conference on Web Services (ICWS). 33–40. https://doi.org/10.1109/ICWS.2017.13
  25. A survey on automated log analysis for reliability engineering. ACM computing surveys (CSUR) 54, 6 (2021), 1–37.
  26. Experience report: System log analysis for anomaly detection. In 2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE, 207–218.
  27. Loghub: a large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448 (2020).
  28. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  29. Assessing the Generalizability of Code2vec Token Embeddings. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1–12. https://doi.org/10.1109/ASE.2019.00011
  30. Guidelines for Assessing the Accuracy of Log Message Template Identification Techniques. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1095–1106. https://doi.org/10.1145/3510003.3510101
  31. Impact of Log Parsing on Log-based Anomaly Detection. arXiv preprint arXiv:2305.15897 (2023).
  32. Van-Hoang Le and Hongyu Zhang. 2023a. An Evaluation of Log Parsing with ChatGPT. arXiv preprint arXiv:2306.01590 (2023).
  33. Van-Hoang Le and Hongyu Zhang. 2023b. Log Parsing with Prompt-based Few-shot Learning. In 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE).
  34. LAnoBERT: System log anomaly detection based on BERT masked language model. arXiv preprint arXiv:2111.09564 (2021).
  35. Comparing code explanations created by students and large language models. arXiv preprint arXiv:2304.03938 (2023).
  36. Studying software logging using topic models. Empirical Software Engineering 23 (2018), 2655–2694.
  37. Did We Miss Something Important? Studying and Exploring Variable-Aware Log Abstraction. arXiv preprint arXiv:2304.11391 (2023).
  38. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 35 (2022), 1950–1965.
  39. Scalable and Adaptive Log-based Anomaly Detection with Expert in the Loop. arXiv preprint arXiv:2306.05032 (2023).
  40. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
  41. Uniparser: A unified log parser for heterogeneous log data. In Proceedings of the ACM Web Conference 2022. 1893–1901.
  42. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7
  43. Log-based abnormal task detection and root cause analysis for spark. In 2017 IEEE International Conference on Web Services (ICWS). IEEE, 389–396.
  44. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021).
  45. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837 (2022).
  46. Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation. arXiv preprint arXiv:2305.16938 (2023).
  47. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment 2021, 12 (2021), 124003.
  48. Self-supervised log parsing. In Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part IV. Springer, 122–138.
  49. Improving language understanding by generative pre-training. (2018).
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  51. From ChatGPT-3 to GPT-4: a significant advancement in ai-driven NLP tools. Journal of Engineering and Emerging Technologies 2, 1 (2023), 1–11.
  52. Google Research. 2023. The Flan Collection: Advancing open source methods for instruction tuning – Google Research Blog. https://ai.googleblog.com/2023/02/the-flan-collection-advancing-open.html. (Accessed on 07/16/2023).
  53. Keiichi Shima. 2016. Length matters: Clustering system log messages using length of words. arXiv preprint arXiv:1611.03213 (2016).
  54. A Theoretical Framework for Understanding the Relationship Between Log Parsing and Anomaly Detection. In Runtime Verification: 21st International Conference, RV 2021, Virtual Event, October 11–14, 2021, Proceedings. Springer-Verlag, Berlin, Heidelberg, 277–287. https://doi.org/10.1007/978-3-030-88494-9_16
  55. C-Brain: A Deep Learning Accelerator That Tames the Diversity of CNNs through Adaptive Data-Level Parallelization. In Proceedings of the 53rd Annual Design Automation Conference (Austin, Texas) (DAC ’16). Association for Computing Machinery, New York, NY, USA, Article 123, 6 pages. https://doi.org/10.1145/2897937.2897995
  56. LogSig: Generating System Events from Raw Textual Logs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (Glasgow, Scotland, UK) (CIKM ’11). Association for Computing Machinery, New York, NY, USA, 785–794. https://doi.org/10.1145/2063576.2063690
  57. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  58. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  59. R. Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764). 119–126. https://doi.org/10.1109/IPOM.2003.1251233
  60. Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference. arXiv preprint arXiv:2303.04673 (2023).
  61. Do language models perform generalizable commonsense inference? arXiv preprint arXiv:2106.11533 (2021).
  62. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur) 53, 3 (2020), 1–34.
  63. Would you like a quick peek? providing logging support to monitor data processing in big data applications. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 516–526.
  64. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668 (2023).
  65. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  66. Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design. arXiv preprint arXiv:2303.07839 (2023).
  67. Sherlog: error diagnosis by connecting clues from run-time logs. In Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems. 143–154.
  68. GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=-Aw0rrrPUF
  69. zero_nlp contributors. 2023. ”A large collection of large language model-powered solutions in Chinese”. https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/simple_thu_chatglm6b. (Accessed on 06/25/2023).
  70. Anomaly Detection via Mining Numerical Workflow Relations from Logs. In 2020 International Symposium on Reliable Distributed Systems (SRDS). 195–204. https://doi.org/10.1109/SRDS51746.2020.00027
  71. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023).
  72. Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 807–817.
  73. Assessing Generalizability of CodeBERT. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 425–436. https://doi.org/10.1109/ICSME52107.2021.00044
  74. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
  75. Tools and benchmarks for automated log parsing. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 121–130.
  76. Chen Zhuge and Risto Vaarandi. 2017. Efficient Event Log Mining with LogClusterC. In 2017 ieee 3rd international conference on big data security on cloud (bigdatasecurity), ieee international conference on high performance and smart computing (hpsc), and ieee international conference on intelligent data and security (ids). 261–266. https://doi.org/10.1109/BigDataSecurity.2017.26
Citations (31)

Summary

  • The paper introduces LLMParser, which leverages fine-tuning and few-shot learning to improve log parsing accuracy and adaptability.
  • It achieves up to 10% performance gains on benchmark datasets like HDFS, BGL, and HPC compared to traditional log parsing methods.
  • The study highlights a trade-off between improved log parsing flexibility and the increased computational resources required by LLMs.

LLMParser: An Exploratory Study on Using LLMs for Log Parsing

Introduction

The paper "LLMParser: An Exploratory Study on Using LLMs for Log Parsing" investigates the potential of LLMs in parsing log data, which is an essential task in software engineering and system maintenance. Log parsing transforms unstructured log messages into structured data that can be leveraged for anomaly detection, performance monitoring, and root-cause analysis. Traditional log parsers rely heavily on rule-based approaches or generic pattern recognition, which can be inflexible and require extensive manual tuning.

Approach

The study introduces LLMParser, a framework employing LLMs to extend the capabilities of existing log parsing methodologies. The core hypothesis is that LLMs, with their ability to process and understand natural language, can be adapted to parse logs by recognizing patterns in the semi-structured data. The authors conduct experiments using various LLMs, tested through fine-tuning and few-shot learning strategies to improve log parsing efficacy when dealing with diverse log formats and patterns.

Experimental Setup and Results

The authors evaluate LLMParser on multiple benchmark datasets from LogHub, such as HDFS, BGL, and HPC, comparing its performance to existing automatic and semi-automatic log parsing methods like Drain, Spell, and LogSig. They utilize metrics including parsing accuracy, F1 score, and runtime efficiency to benchmark LLMParser's performance. The study reports that LLMParser achieves competitive parsing accuracy, with improvements up to 10% in certain datasets over traditional techniques. Moreover, LLMParser demonstrates considerable adaptability across varied log structures, showing a reduced need for dataset-specific tuning.

Discussion

The exploration into LLMs for log parsing raises several practical considerations. LLMParser shows how incorporating LLMs into log analysis can streamline the adaptability of log parsing techniques. However, the study acknowledges computational trade-offs, as LLMs require significant computational resources for both inference and model adaptation. The results suggest a trade-off between computational expense and the improved flexibility LLMs offer over conventional parsers. The potential of using smaller, optimized LLMs opens avenues for making such techniques more accessible and less resource-intensive.

Implications and Future Work

The paper highlights significant implications for the intersection of NLP and software engineering. Integrating LLMs into log parsing tools can enhance autonomous log analysis systems, reduce manual intervention, and potentially improve anomaly detection frameworks. Moving forward, research could focus on real-time log parsing capabilities, refining LLM fine-tuning to reduce resource consumption, and extending this approach to more complex anomaly detection systems. Additionally, further exploration into hybrid models combining rule-based logic with LLM insights could offer balanced solutions in terms of accuracy and computational efficiency.

Conclusion

"LLMParser: An Exploratory Study on Using LLMs for Log Parsing" demonstrates the potential of leveraging modern LLMs for parsing logs, achieving promising results compared to traditional methods. The study provides a foundation for future research in adapting LLMs for log analysis, emphasizing a balance between computational requirements and parsing accuracy. By framing LLMs as core components within log parsing workflows, this research sets a direction for enhancing automated log analysis, with implications for both academia and industry.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.