Holmes: LLM-based DDoS Detective
- Holmes (DDoS Detective) is an LLM-based agent that transforms cloud telemetry into compact, auditable evidence reports for transparent DDoS investigation.
- It employs a hierarchical workflow combining continuous monitoring, sFlow-based triage, and on-demand PCAP collection to optimize cost and latency.
- Its structured evidence abstraction and protocol-driven reasoning yield high attribution accuracy (86%) and complete audit trails for incident analysis.
Holmes (DDoS Detective) is an LLM-based agent designed for evidence-grounded, auditable Distributed Denial-of-Service (DDoS) investigation in cloud environments. Unlike conventional rule-based or supervised-learning classifiers, Holmes reframes DDoS defense as an evidence-driven investigative process, where the model assumes the role of a virtual Site Reliability Engineering (SRE) investigator. It integrates a hierarchical detection workflow, semantic evidence abstraction, and strict protocol-driven reasoning to generate machine-consumable, auditable incident reports, effectively bridging the gap between wire-speed monitoring and high-fidelity root-cause attribution (Chen et al., 21 Jan 2026).
1. Design Principles and Motivation
Holmes addresses operational challenges in cloud-native environments posed by fast-evolving, multi-vector DDoS attacks that target large, centralized resource pools and exploit extensive attack surfaces. Traditional rule-based defenses achieve wire-speed detection but lack transparency and root-cause traceability. Supervised ML/DL techniques often act as opaque black-boxes susceptible to misclassifying zero-day attacks and require large labeled datasets that are rarely available at cloud scale.
Key system goals include:
- Wire-Speed Telemetry, On-Demand Reasoning: Separating inexpensive, continuous monitoring from selective, cost-intensive LLM-based analysis.
- Semantic Evidence Abstraction: Transforming binary packet data into compact, structured Evidence Packs that capture high-signal, interpretable fingerprints.
- Evidence-Grounded Reasoning: Institutionalizing a structure-first investigative protocol and a "Quote Rule," ensuring every analytical claim directly references precise substrings from the original Evidence Pack.
2. Hierarchical Workflow Architecture
Holmes's operational architecture is structured as a funnel-like pipeline consisting of three layers:
- L1 – Continuous Telemetry: Collects interface-level counters (bytes/s, packets/s, queue drops) at fixed intervals (Δt, e.g., 1s). Anomalies are detected when metrics like exceed adaptive thresholds ( or ).
- L2 – Lightweight Triage: Upon L1 anomaly, samples sFlow at high rates (e.g., 1:10,000). Determines dominant Layer 4 protocol and identifies the likely victim IP. Guides the workflow to appropriate evidence branches (UDP or TCP) without invoking the LLM.
- L3 – On-Demand Investigation: Triggers budgeted PCAP collection (e.g., a 20s window). The Evidence Pack is extracted and the LLM is invoked under prompt contracts for structured analysis. Cooldown and deduplication logic suppress repeat investigations on the same event window.
A high-level control flow is as follows:
1 2 3 4 5 6 7 8 |
for each Δt window:
if L1.trigger(Δt):
S ← sFlow.sample(window=Δt)
(proto*, victim) ← triage(S)
PCAP ← collect_pcap(window=Δt)
E ← extract_evidence(PCAP, proto*)
report ← LLM_investigate(E, proto*, victim)
log(report, E) |
3. Evidence Pack Abstraction and Statistical Summaries
The Evidence Pack is a compact, structured JSON representation capturing salient features of the incident investigation window. Its schema includes:
incident_window: Time intervalproto: "UDP" or "TCP"victim: Victim IP addressprimary_samples: List of packet summaries, each reporting:lengthprintable_ratioentropyascii_excerpthexdump
flag_stats(for TCP): e.g.,syn_only_ratio,ack_only_ratio
Relevant computed statistics:
- Printable-Byte Ratio:
- Shannon Entropy:
- TCP-Flag Ratios:
This abstraction enables reproducible chains of evidence, offering both human-readability and strict anchorability for post hoc audit.
4. Investigation Protocol and Output Constraints
Holmes enforces a structure-first, strictly-auditable investigation protocol implemented through its prompt contract:
- Payload Style Inference: Classifies primary samples as one of six canonical forms:
random_noise,http_like,asn1_oid_like,kv_semicolon_list,text_banner_like,mixed_unclear. - Attack Family Categorization: Selects among
{Reflection/Amplification, Direct Flood, Mixed, Unknown}. - Attack Type Determination: If
proto=TCP, restricted to{SYN Flood, [ACK](https://www.emergentmind.com/topics/adaptive-circuit-knitting-ack) Flood, HTTP/2 Rapid Reset, Unknown}; ifproto=UDP, only reflection types if evidenced (e.g., ASN.1/OID anchors denote LDAP Reflection).
All analytic claims must cite direct substrings (the "Quote Rule") from the Evidence Pack. No port-based assumptions are permitted. Output is a strict JSON including fields: verdict, attack_family, attack_type, analysis_trace, key_evidence, reasoning, recommended_actions, and confidence.
Sample output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
{
"verdict":"Confirmed DDoS – LDAP Reflection",
"attack_family":"Reflection/Amplification",
"attack_type":"LDAP‐CLDAP Reflection",
"analysis_trace":{
"payload_style":"asn1_oid_like",
"samples_checked":3,
"decision":"ASN.1/OID patterns correspond to LDAP"
},
"key_evidence": [
"`1.2.840.113556`",
"`supportedLDAPVersion`"
],
"reasoning":"Evidence Pack shows OID-like strings (`1.2.840.113556`) …",
"recommended_actions":[
"Block CLDAP (UDP/389) to victim",
"Alert security team",
"Enable deeper packet capture for post-mortem"
],
"confidence":0.95
} |
5. LLM Integration, Prompt Contract, and Cost Analysis
The Holmes LLM backend (OpenPangu-7B via OpenAI-compatible API) operates in a contract-constrained environment:
- Prompt Engineering: Temperature set to zero for deterministic outputs, disabling chain-of-thought, providing a fixed outline (Route-1) that the LLM must fill.
- Context Window: Contains Evidence Pack (≤2kB), prompt contract, and up to three precedent examples, capped at ≤8k tokens.
- Cost and Latency Optimization: LLM inference is invoked only on anomaly-triggered windows (~5–10%), with evidence extraction capped at 20 packets and 50 hexdump lines (<100ms extraction). LLM reasoning latency averages 500ms, with end-to-end incident analysis ≈650ms, aligning with CSP SLOs for near real-time investigation.
6. Empirical Evaluation and Comparative Analysis
Holmes's performance has been evaluated using the CICDDoS2019 reflection/amplification datasets (DNS, NetBIOS, SNMP, LDAP/CLDAP, MSSQL, SSDP) and synthetic UDP/SYN/ACK flooding scenarios. Experiments utilized interface counters, sFlow (1:10,000), and 20s PCAP slices within a replayed environment.
| Metric | Holmes | Rule-Based Baseline | Supervised RF |
|---|---|---|---|
| Attribution Accuracy | 0.86 | 0.75 | 0.80 |
| False Positive Rate | 0.05 | 0.12 | 0.10 |
| False Negative Rate | 0.08 | 0.10 | 0.07 |
| Avg. Latency (ms) | 650 | 50 | 80 |
| Audit-Log Completeness | 100% | N/A | 0% |
Holmes demonstrates 86% end-to-end attribution accuracy across 100 replayed incidents, outperforming both signature rule sets (75%) and a supervised Random Forest (80%) on the same data. False positives (5%) and false negatives (8%) remain low, and every incident report is accompanied by a full audit trail including the Evidence Pack and strict JSON output.
Rule-based approaches are brittle to new variants and offer no evidence trail. Supervised random forests provide no audit chains beyond feature importance scores. Holmes's combination of structured evidence, protocol-constrained LLM inference, and comprehensive audit logging contributes to its improved attribution and transparency.
7. Discussion, Limitations, and Future Directions
Holmes exposes traceable failure modes; for instance, in mixed-signal (hybrid) attacks (e.g., TCP SYN floods with HTTP-like fragments), the LLM may overweight application-cue features, which may result in misclassification. However, because the chain of evidence and rationale is logged for every error, operators can rapidly distinguish failures rooted in inadequate observability (e.g., incomplete L2 triage) from reasoning errors in the LLM component.
Identified future directions include:
- Adaptive Threshold Learning: Continually adjusting thresholds (, ) in response to feedback to reduce false or noisy triggers.
- Multi-Model Ensemble Voting: Integrating multiple LLMs under a common contract to improve robustness and reduce reasoning bias.
- Real-Time Mitigation Integration: Feeding JSON verdicts from Holmes directly into downstream firewalls or scrubbing systems to automate DDoS countermeasures (Chen et al., 21 Jan 2026).
Holmes demonstrates that with a contract-constrained workflow, semantic evidence abstraction, and a structure-first reasoning mandate, LLM-based agents can provide cost-effective, transparent, and auditable DDoS investigation in modern cloud networks.