LLMHoney: A Real-Time SSH Honeypot with Large Language Model-Driven Dynamic Response Generation

Published 1 Sep 2025 in cs.CR | (2509.01463v1)

Abstract: Cybersecurity honeypots are deception tools for engaging attackers and gather intelligence, but traditional low or medium-interaction honeypots often rely on static, pre-scripted interactions that can be easily identified by skilled adversaries. This Report presents LLMHoney, an SSH honeypot that leverages LLMs to generate realistic, dynamic command outputs in real time. LLMHoney integrates a dictionary-based virtual file system to handle common commands with low latency while using LLMs for novel inputs, achieving a balance between authenticity and performance. We implemented LLMHoney using open-source LLMs and evaluated it on a testbed with 138 representative Linux commands. We report comprehensive metrics including accuracy (exact-match, Cosine Similarity, Jaro-Winkler Similarity, Levenshtein Similarity and BLEU score), response latency and memory overhead. We evaluate LLMHoney using multiple LLM backends ranging from 0.36B to 3.8B parameters, including both open-source models and a proprietary model(Gemini). Our experiments compare 13 different LLM variants; results show that Gemini-2.0 and moderately-sized models Qwen2.5:1.5B and Phi3:3.8B provide the most reliable and accurate responses, with mean latencies around 3 seconds, whereas smaller models often produce incorrect or out-of-character outputs. We also discuss how LLM integration improves honeypot realism and adaptability compared to traditional honeypots, as well as challenges such as occasional hallucinated outputs and increased resource usage. Our findings demonstrate that LLM-driven honeypots are a promising approach to enhance attacker engagement and collect richer threat intelligence.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an SSH honeypot that applies LLM-driven dynamic response generation to simulate realistic attacker interactions.
The paper benchmarks 13 LLM variants using metrics like cosine similarity and BLEU score to evaluate trade-offs in accuracy, latency, and memory overhead.
The findings highlight that LLM-based dynamic honeypots can capture richer threat intelligence while facing challenges such as computational overhead and output hallucination.

"LLMHoney: A Real-Time SSH Honeypot with LLM-Driven Dynamic Response Generation" (2509.01463)

Introduction

The paper introduces LLMHoney, an innovative SSH honeypot leveraging LLMs to simulate dynamic and contextually appropriate responses to attacker commands. Traditional honeypots often utilize static responses that can be easily identified by skilled adversaries, thereby limiting their effectiveness in engaging sophisticated attackers. By integrating LLMs into its architecture, LLMHoney aims to offer high-interaction experiences without the risks associated with running real operating systems. This approach allows for more realistic engagement by adapting to unexpected commands while maintaining system consistency through the use of a stateful virtual filesystem.

System Architecture

LLMHoney's architecture comprises several components that work in synergy to ensure seamless interaction and efficient response generation.

Figure 1: High-level architecture of LLMHoney.

The core components include configuration settings, a network listener to manage SSH connections, an authentication manager for handling user credentials, a session handler integrated with an LLM engine for dynamic command processing, and logging mechanisms for session data storage. The incorporation of LLMs into the command-processing loop enables real-time generation of responses, significantly enhancing the interactive capabilities of the honeypot compared to traditional, script-based systems.

Evaluation Methodology

The paper outlines a comprehensive evaluation of LLMHoney, detailing the methodology used to assess performance across various metrics including accuracy, latency, and memory overhead.

Figure 2: Average similarity metrics for each model: Cosine Similarity, Jaro-Winkler, Token Accuracy, and BLEU Score.

Using a testbed of 138 Linux commands, LLMHoney was benchmarked with 13 LLM variants ranging from 0.36B to 3.8B parameters. The evaluation pipeline involved measuring exact string match accuracy, token-level accuracy, cosine similarity, Jaro-Winkler similarity, Levenshtein distance, and BLEU scores against ground-truth outputs. The models were tested for latency performance, recording round-trip times and memory usage to understand the trade-offs between command realism and computational resources.

Figure 3: Mean response latency (red bars, left axis) and absolute memory overhead (blue line, right axis) per model.

Results and Analysis

The comparative analysis reveals significant variance in model performance, highlighting the trade-offs between response accuracy and resource consumption.

Figure 4: Trade-off between average response time and Cosine Similarity.

Models like Gemini-2.0 and Phi3-3.8B stood out for providing reliable, accurate responses with relatively moderate latency and memory overhead. Smaller models exhibited faster response times but higher rates of output hallucination and poorer accuracy. The findings demonstrate that LLM-driven honeypots can significantly enhance attacker engagement by providing more dynamic and believable system interactions than their traditional counterparts.

Practical and Theoretical Implications

LLMHoney represents a promising advancement in honeypot technology, offering a more adaptive solution for cybersecurity defense. Its ability to generate contextually appropriate outputs enhances the realism of attacker engagements, potentially leading to richer threat intelligence. However, the paper also notes the challenges associated with increased computational demands and the risks of output hallucination, which could undermine the deception if not carefully managed.

Future Directions

The paper concludes with a discussion on future research avenues, including optimizing LLM-driven honeypots through techniques like automated hallucination detection and expanding the virtual filesystem for deeper simulation. The integration of advanced adaptive strategies to counteract sophisticated attacker detection methods is also suggested. These developments are expected to improve system efficiency and effectiveness, making AI-driven honeypots viable for broader deployment in cybersecurity frameworks.

Conclusion

LLMHoney illustrates the potential of leveraging AI and LLMs within honeypot systems to create more engaging and realistic decoys that gather valuable intelligence from adversarial interactions. The findings highlight the benefits of dynamic response generation balanced with the computational and consistency challenges inherent in such a system. As AI technologies continue to evolve, LLMHoney represents an important step towards more autonomous and intelligent cybersecurity defenses.

Markdown Report Issue