Democratizing ML for Enterprise Security: A Self-Sustained Attack Detection Framework

Published 9 Dec 2025 in cs.CR and cs.AI | (2512.08802v1)

Abstract: Despite advancements in machine learning for security, rule-based detection remains prevalent in Security Operations Centers due to the resource intensiveness and skill gap associated with ML solutions. While traditional rule-based methods offer efficiency, their rigidity leads to high false positives or negatives and requires continuous manual maintenance. This paper proposes a novel, two-stage hybrid framework to democratize ML-based threat detection. The first stage employs intentionally loose YARA rules for coarse-grained filtering, optimized for high recall. The second stage utilizes an ML classifier to filter out false positives from the first stage's output. To overcome data scarcity, the system leverages Simula, a seedless synthetic data generation framework, enabling security analysts to create high-quality training datasets without extensive data science expertise or pre-labeled examples. A continuous feedback loop incorporates real-time investigation results to adaptively tune the ML model, preventing rule degradation. This proposed model with active learning has been rigorously tested for a prolonged time in a production environment spanning tens of thousands of systems. The system handles initial raw log volumes often reaching 250 billion events per day, significantly reducing them through filtering and ML inference to a handful of daily tickets for human investigation. Live experiments over an extended timeline demonstrate a general improvement in the model's precision over time due to the active learning feature. This approach offers a self-sustained, low-overhead, and low-maintenance solution, allowing security professionals to guide model learning as expert ``teachers''.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a self-sustained two-stage framework that integrates high-recall YARA rules with lightweight ML to reduce false positives and support continuous analyst feedback.
It proposes the novel Simula method for seedless synthetic data generation, creating balanced datasets without relying on historical labeled attacks.
Active learning and simple lexical models enable near-real-time threat detection over petabyte-scale logs, ensuring operational scalability and improved precision.

Democratizing ML for Enterprise Security: A Self-Sustained Attack Detection Framework

Hybrid Approach: Motivations and Architecture

The persistent predominance of rule-based detections in enterprise security, despite considerable advances in ML, stems from operational efficiency, data scarcity, and the skill gap separating security engineering and data science. Rule-based systems such as those implemented with YARA offer fast matching and a familiar operational paradigm but suffer high rates of both false positives and false negatives due to pattern rigidity and unsustainable manual tuning. Conversely, ML-driven detectors, while capable of adaptive learning and distribution shift resilience, are hindered by computation demands and the practical challenge of training robust models without large volumes of accurately labeled attack data.

To resolve these dual limitations, the paper proposes a two-stage hybrid detection pipeline. The first stage employs intentionally broad, high-recall YARA rules, ensuring comprehensive threat coverage but admitting a flood of noisy positives. The second stage introduces a downstream ML classifier, tasked with suppression of false positives and calibrated to the investigation budget and operational constraints.

This system is underpinned by an active feedback loop: results from analyst investigations are continuously incorporated into model retraining cycles to mitigate rule degradation and adapt to evolving threats and network contexts.

Synthetic Data Generation via Simula

A core innovation of the framework is its use of Simula, a seedless synthetic data generation methodology. Simula utilizes taxonomic reasoning and agentic scenario synthesis to create balanced, high-diversity training sets without requiring either historical labeled data or in-depth ML expertise. Security analysts define strategies and taxonomies, which Simula transforms into expansive sets of realistic positive and negative samples, subject to automated quality assessment by a critic LLM. This democratizes ML training, enabling domain experts to serve as guiding "teachers" without direct engagement with model internals.

Lightweight Model Selection and Scalability

Empirical evaluation within the framework discounts the necessity of heavyweight transformer or embedding models for command-line threat classification tasks. Most notably, simple lexical models—e.g., n-gram tokenization—are shown to match or surpass the performance of embedding-based models such as CmdCaliper and RETSim for discriminating terminal command line usage (Figure 1).

Figure 1: Precision-Recall curve comparing model performances across lexical (N-gram), CMDCaliper, and RETSim models for three enterprise detection scenarios.

The computational efficiency of lexical modeling ensures operational scalability, supporting near-real-time inference across petabyte-scale daily log volumes while maintaining feasible alert volumes for human analysts.

Active Learning and Analyst Efficiency

The framework incorporates active learning principles for feedback ingestion, assigning time-decaying weights to analyst-labeled data points and adjusting for class imbalance—critical in environments with a ratio of benign-to-malicious instances orders of magnitude apart. Feedback-driven retraining enables rapid adaptation to novel threats and network changes, avoiding the stagnation associated with static rule sets.

Extensive live deployment and A/B testing demonstrate the active learning pipeline’s superiority. The active model not only preserves coverage on known incidents, but consistently identifies additional true positives and reduces false positive tickets relative to a fixed, non-updatable baseline, quantifying a net gain in both threat detection and analyst efficiency.

Quantitative Validation

The system has been evaluated across multiple Google-internal security scenarios, including reverse shell, hacking tools, and living-off-the-land (LOTL) attacks. It processes over 250 billion daily raw events, reducing them through staged filtering and ML inference to fewer than one ticket per detection scenario per day—a reduction factor of nine orders of magnitude. Simula-generated synthetic datasets with 20,000 balanced samples in 20 minutes enable models to achieve a precision-recall F1 score of 0.95, outperforming labor-intensive manually curated datasets by a significant margin.

Theoretical and Practical Implications

This framework demonstrates that self-sustaining, analyst-guided threat detection pipelines are operationally feasible and can maintain precision and adaptability at hyperscale. The use of reasoning-centric synthetic data generation and lightweight ML offers clear advantages: reduced deployment and maintenance overhead, robust performance against distributional drift, and new avenues for democratized ML application across domains with severe data scarcity or privacy risk.

These results suggest a paradigm shift in applied enterprise ML security, privileging automation, reasoning-based synthetic data, and continuous analyst-in-the-loop adaptation over static, heuristic-heavy practices.

Prospective Directions

Further research opportunities arise in exploring chunking approaches and hybrid lexical-semantic models for improved handling of long, complex command lines, as well as the generalization of seedless synthetic data generation to additional security and operational domains. The established feedback-driven continuous adaptation pipeline can be extended to support broader classes of security events, anomaly types, and federated organizational deployments.

Conclusion

The proposed framework effectively integrates rule-based filtering, lightweight ML classification, reasoning-driven synthetic data generation, and active learning to deliver scalable, precise, and maintainable enterprise threat detection. By enabling security analysts to guide model refinement without deep ML expertise and rigorously validating efficiency and adaptability at production scale, the system offers a compelling blueprint for the democratization of ML in security operations.