Papers
Topics
Authors
Recent
Search
2000 character limit reached

Provenance Intrusion Detection Systems

Updated 6 February 2026
  • Provenance-based intrusion detection systems are security architectures that analyze detailed causal graphs of system events to identify anomalies.
  • They leverage kernel-level event capture, real-time graph processing, and anomaly scoring to detect advanced attacks with high true positive rates and low false positives.
  • Implementation challenges include managing rapidly growing graphs, computational expense of subgraph matching, and fine-tuning alert thresholds for optimal performance.

A provenance-based intrusion detection system (PIDS) is a security architecture that detects intrusions by analyzing detailed, structured histories—"provenance"—of all interactions between digital objects within a computing system. PIDSs leverage the construction and real-time analysis of directed acyclic provenance graphs, representing the full causal context of system execution, to robustly identify anomalous or malicious behavior, including advanced attacks that evade traditional signature-based or event-sequence detectors.

1. Formal Provenance Models and Data Acquisition

A whole-system provenance graph is mathematically defined as a directed acyclic, attributed graph G=(V,E,V,E)G = (V, E, \ell_V, \ell_E), where:

  • VV is the set of vertices, partitioned into entity nodes (files, sockets, pipes), activity nodes (process executions, thread spawns), and agent nodes (users, roles).
  • EV×VE \subseteq V \times V consists of directed edges encoding causal (information-flow) relationships.
  • V:VAV\ell_V: V \rightarrow \mathcal{A}_V maps each vertex to its attributes (e.g., file path, process id, timestamp).
  • E:EAE\ell_E: E \rightarrow \mathcal{A}_E assigns each edge relevant attributes (system call type, byte count, offset).

Provenance is inherently append-only: GG grows monotonically in time and admits a topological ordering consistent with system event causality.

For real-time capture, modern systems deploy Linux Security Module (LSM) hooks to intercept all security-relevant kernel events, including process executions, file reads/writes, and socket operations. Each event triggers the creation of new vertices and edges in the provenance graph, with associated labels capturing fine-grained context (e.g., syscall name and return code, timestamp). The resultant stream forms the basis for anomaly analysis, with the graph stored in an append-only, queryable database (Han et al., 2018).

2. PIDS Architectural Paradigm and Analytics Workflow

A typical PIDS comprises the following pipeline components:

  1. Provenance Capture Layer: Kernel-level LSM hooks record all relevant accesses, emitting structured events per subject, object, operation, and metadata.
  2. Collector & Message Bus: Aggregates and batch-processes per-host event streams into time-ordered provenance tuples, which are transmitted to downstream processors.
  3. Provenance Graph Store: Maintains the evolving GG in a graph database, supporting efficient neighborhood and path queries.
  4. Graph Processing Engine: Embodies a vertex-centric streaming framework applying anomaly or attack-detection algorithms to subgraphs as they arrive, maintaining both sliding-window and historical summaries.
  5. Alerting & Dashboard: Consumes anomaly scores, applies thresholds, correlates alarms, and presents attack chains in a form suitable for triage and forensics.

The full PIDS event lifecycle is as follows:

  • System event triggers an LSM hook and provenance tuple emission.
  • The collector normalizes and forwards the tuple.
  • The graph store appends a new edge and potentially new vertices, labeled with operation type, timestamp, and user-ID.
  • The processing engine updates the affected subgraph.
  • Detection modules compute a normalized feature vector f()f(\cdot) and score; alerts are generated when the score surpasses a threshold.
  • Analytical dashboards render the provenance chain for investigation (Han et al., 2018).

3. Detection Algorithms and Graph-Driven Analytics

PIDSs deploy a variety of structural and statistical techniques for anomaly detection:

Subgraph-Matching and Structural Anomaly Detection:

Given a pattern PP (e.g., fork \to exec \to socket-send), the system locates all subgraphs SPS \cong P where node/edge attributes match (V\ell_V, E\ell_E), typically leveraging high-selectivity attributes to mitigate the NP-complete search space. Windowing around new edges localizes detection.

Graph Kernel and Similarity Metrics:

Time-windowed provenance subgraphs GtG_t are mapped to feature vectors (motif counts, degree distribution, walk features). Similarity between GiG_i and GjG_j is k(Gi,Gj)=ϕ(Gi),ϕ(Gj)k(G_i,G_j) = \langle\phi(G_i),\phi(G_j)\rangle. Anomaly scoring is 1maxjk(Gt,Gj)1 - \max_j k(G_t,G_j); deviations from the reference set trigger alerts.

Flow-Similarity in Causal Chains:

Paths π=(v0v1vk)\pi=(v_0 \to v_1 \to \cdots \to v_k) are compared using dynamic time-warping distance dDTW(π,π)d_{DTW}(\pi,\pi') over edge label sequences. Significant deviations from known benign flows indicate potential attacks.

Per-Node and Subgraph Anomaly Scoring:

Scores are linear functions:

score(v)=n=1mwnfn(v)\text{score}(v) = \sum_{n=1}^m w_n \cdot f_n(v)

where fn(v)f_n(v) are normalized features (degree change rate, new label occurrence, clustering coefficient shift), and wnw_n are tuned weights. Global thresholds for alerting are selected by ROC curve analysis (Han et al., 2018).

4. Evaluation, Performance, and Limitations

Performance assessment incorporates:

  • True Positive Rate (TPR): TP/(TP+FN)TP/(TP+FN)
  • False Positive Rate (FPR): FP/(FP+TN)FP/(FP+TN)
  • Detection Latency (LL): talerttintrusion_startt_{\text{alert}} - t_{\text{intrusion\_start}}
  • Memory Footprint (MM): O(V+E)O(|V|+|E|) plus indexing
  • Throughput (λ\lambda): events per second processed

Empirical results report:

  • Realtime detection at λ50,000\lambda \approx 50,000 events/s
  • Latency L200L \approx 200 ms for anomaly checks
  • Memory \approx 1 GB/10 million edges
  • TPR 95%\approx 95\%, FPR 2%\approx 2\% on synthetic APT-style attacks

Strengths:

  • Attack-vector agnosticism: all causal flows are observed, preventing evasion via novel attack vectors.
  • Long-range causal reasoning: chain-based provenance connects entry-point exploits to effects.
  • Robustness: attacks must duplicate legitimate provenance patterns, a stringent requirement.

Limitations:

  • Graph growth and summarization: the DAG becomes unmanageably large; summarization (via grammars or sketches) is necessary.
  • Online efficiency: subgraph matching and kernel calculations are computationally expensive; incremental algorithms are often needed.
  • Alert tuning: insufficiently discriminative feature sets cause excessive false positives in rare but benign behaviors.

Open Challenges:

  • Selection of analysis window size to balance long attack coverage vs. resolution.
  • Memory management via provenance summarization grammars.
  • Incorporating additional features and temporal priors to suppress spurious alerts (Han et al., 2018).

5. Implementation Guidelines and Deployment Practices

Best practices for deploying a PIDS include:

  • Employing high-fidelity capture via LSM-based, whole-system mechanisms (e.g., CamFlow, SPADE) for completeness of observation.
  • Precisely documenting kernel event-to-graph mapping to avoid semantic mismatches and blind spots.
  • Incremental model building: maintain feature statistics and motif counts online to avoid costly recomputation.
  • ROC-curve calibration using diverse benchmarks and controlled intrusion injection for effective threshold selection.
  • Resource-aware deployment: localize the graph store on low-latency storage, provision memory for streaming, and rate-limit event ingestion to avoid overloads.

6. Emerging Directions and Research Outlook

Several research vectors are active in the PIDS domain:

  • Graph-Embedding and Deep Learning: GNNs and node2vec-style embeddings are being integrated for richer, scalable anomaly scoring.
  • Distributed and Parallel Processing: There is a trend toward scaling to multi-host and datacenter settings using platforms like Flink/Gelly, with global partial order.
  • Grammar Induction: The automated induction of regular grammars over provenance DAGs to yield compact summaries for persistent attack tracking.
  • Cross-Host Correlation: Fusion of per-host graphs into a global model to detect coordinated, multi-machine attacks.

By integrating formal graph models, streaming capture and analysis, hybrid pattern/kernel detectors, and robust scoring, a contemporary PIDS achieves both generality and causal depth. Continued progress relies on advances in graph summarization, adaptive analytics, and rigorous, workload-adaptive calibration (Han et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Provenance-Based Intrusion Detection Systems (PIDSs).