DynaMark: A Reinforcement Learning Framework for Dynamic Watermarking in Industrial Machine Tool Controllers
Published 29 Aug 2025 in eess.SY, cs.AI, cs.CR, cs.LG, and stat.AP | (2508.21797v1)
Abstract: Industry 4.0's highly networked Machine Tool Controllers (MTCs) are prime targets for replay attacks that use outdated sensor data to manipulate actuators. Dynamic watermarking can reveal such tampering, but current schemes assume linear-Gaussian dynamics and use constant watermark statistics, making them vulnerable to the time-varying, partly proprietary behavior of MTCs. We close this gap with DynaMark, a reinforcement learning framework that models dynamic watermarking as a Markov decision process (MDP). It learns an adaptive policy online that dynamically adapts the covariance of a zero-mean Gaussian watermark using available measurements and detector feedback, without needing system knowledge. DynaMark maximizes a unique reward function balancing control performance, energy consumption, and detection confidence dynamically. We develop a Bayesian belief updating mechanism for real-time detection confidence in linear systems. This approach, independent of specific system assumptions, underpins the MDP for systems with linear dynamics. On a Siemens Sinumerik 828D controller digital twin, DynaMark achieves a reduction in watermark energy by 70% while preserving the nominal trajectory, compared to constant variance baselines. It also maintains an average detection delay equivalent to one sampling interval. A physical stepper-motor testbed validates these findings, rapidly triggering alarms with less control performance decline and exceeding existing benchmarks.
The paper introduces an RL-based dynamic watermarking method that overcomes the limitations of static, LTI-based schemes.
It formulates watermarking as an MDP and employs a DDPG agent to balance control performance, energy consumption, and detection confidence.
Experimental evaluations on digital twins and a physical testbed demonstrate improved detection speed and reduced energy overhead.
DynaMark: Reinforcement Learning for Dynamic Watermarking in Industrial Machine Tool Controllers
Introduction and Motivation
The proliferation of networked Machine Tool Controllers (MTCs) in Industry 4.0 environments has exposed manufacturing systems to sophisticated cyber-physical threats, notably replay attacks that exploit outdated sensor data to manipulate actuators. Traditional watermarking-based detection schemes, which superimpose constant-variance Gaussian signals onto control inputs, are fundamentally limited by their reliance on linear time-invariant (LTI) and Gaussian assumptions. These static approaches are ill-suited for the time-varying, proprietary, and often nonlinear dynamics of modern MTCs, resulting in suboptimal trade-offs between detection accuracy and control performance.
DynaMark addresses these limitations by formulating dynamic watermarking as a Markov Decision Process (MDP) and leveraging reinforcement learning (RL) to adaptively select watermark covariance in real time. This framework enables the system to balance control performance, energy consumption, and detection confidence, without requiring explicit system identification or prior knowledge of plant dynamics.
Figure 1: Flowchart of the interaction between machine tools, sensors, controllers, and the detector for real-time monitoring and control.
Problem Formulation and Theoretical Foundations
System and Attack Models
The MTC is modeled as a stochastic linear dynamic system:
yt+1​=Ayt​+But​+wt​
where yt​ is the sensor measurement, ut​ is the control input, and wt​ is i.i.d. Gaussian noise. Watermarking is implemented by injecting a zero-mean Gaussian signal ϕt​ with covariance Ut​ into the control input:
ut′​=ut​+ϕt​
Replay attacks are modeled by replacing true sensor measurements with previously recorded data, while flip and injection attacks manipulate control signals and sensor readings, respectively. The residuals rt​=yt​−y​t​ are monitored by a χ2 detector, which triggers alarms based on statistical thresholds.
Residual Analysis and Detection Power
The paper provides rigorous analysis of residual distributions under normal operation and various attack scenarios. Under replay attacks, the test statistic gt∣τ​ follows a generalized χ2 distribution, whose parameters depend on the watermark covariance and system matrices. The detection power is characterized by the Type-II error βt​, which is computed using the cumulative distribution function of the generalized χ2 statistic.
DynaMark Framework and RL-Based Policy Optimization
MDP Formulation
DynaMark models the watermarking problem as an MDP with state st​=(yt​,dt​), where dt​ is the detector's Bayesian belief in the presence of an attack. The action space consists of positive semidefinite matrices Ut​ representing watermark covariance. The reward function is designed to penalize energy consumption and control deviation, while incentivizing high detection confidence:
A Deep Deterministic Policy Gradient (DDPG) agent is trained to optimize the watermarking policy. The actor network outputs watermark covariance, while the critic estimates the Q-value. The RL agent adapts Ut​ online based on observed system state and detector feedback, enabling dynamic trade-off management.
Figure 2: DynaMark framework.
Experimental Evaluation
Digital Twin of Siemens Sinumerik 828D
A high-fidelity digital twin (DT) of the Siemens Sinumerik 828D controller is used to evaluate DynaMark. The DT replicates 2-axis motion control and supports replay attack scenarios. Under normal operation, DynaMark maintains low watermark energy and nominal trajectory tracking. Upon attack onset, the detector's belief dt​ rapidly saturates, and watermark variance Ut​ is adaptively increased to maximize detection power.
Figure 4: DynaMark under normal operation. (a) Detector belief dt​ oscillates early and then falls to 0. (b) Watermark variance Ut​ rises while uncertainty is high, then levels off. (c) Resulting trajectory yt​ tracks the no-watermark baseline.
Figure 5: DynaMark under replay attack starting at τ=200. (a) Belief dt​ jumps to 1 almost immediately after the attack onset. (b) Ut​ is boosted by two orders of magnitude and held high. (c) Physical trajectory departs sharply from the baseline once attack started.
Benchmarking Against Constant-Variance Watermarks
DynaMark is compared to fixed-variance baselines. Under normal conditions, DynaMark achieves 70% lower watermark energy than high-variance schemes, with negligible control degradation. During replay attacks, DynaMark matches the fastest detection delay (ARL1​ = 1 sample) while maintaining superior energy-performance trade-off.
Figure 6: Benchmarking DynaMark against two constant–variance watermarks: (a) energy consumption and control performance under normal operation, (c) detection delay (ARL1​) and (d) detector belief dt​ for one representative trial under a replay attack. Results indicate DynaMark's favorable security–performance trade-off.
Figure 7: Trade-off between detection belief and control performance degradation as functions of constant watermark variance Ut​. Stars mark DynaMark.
Physical Stepper-Motor Testbed
A closed-loop stepper-motor testbed is implemented to validate DynaMark in real hardware. The RL policy is transferred to the physical system via ONNX runtime. Under replay attacks, the detector's belief dt​ rises to 1 within five samples, and DynaMark dynamically adjusts Ut​ to maintain detection power while minimizing energy overhead.
Figure 9: The stepper-motor position under normal conditions: (a) continuous, no watermark, (b) discretized and under DynaMark's DWM, and (c) on its DT and under DynaMark's DWM. (b) and (c) show maintaining control performance across the entire motion profile.
Figure 10: The stepper-motor's response on DT to a replay attack, showing divergence of true position and rapid rise in detector belief.
Figure 11: The stepper-motor response to a replay attack with onset at decision epoch 7, showing rapid alarm and adaptive watermarking.
Comparative Analysis with Optimization-Based Baselines
Constant-variance watermarks derived from LTI approximations and LQG optimization are compared to DynaMark. On the time-varying stepper-motor plant, DynaMark achieves lower median energy and control degradation, with tighter inter-alarm intervals, demonstrating the inadequacy of static designs for non-LTI systems.
Figure 12: Comparison results between DynaMark and five constant watermarks obtained by LTI approximation and solving the optimization problem at different LQG-cost budgets.
Implementation Considerations
Computational Requirements: DynaMark's RL policy inference is decoupled from real-time control via ONNX runtime, enabling deployment on resource-constrained hardware.
Scalability: The framework is agnostic to system order and can be extended to multi-input multi-output (MIMO) plants.
Adaptability: DynaMark does not require explicit system identification, making it suitable for proprietary or closed-architecture MTCs.
Limitations: The current implementation assumes zero-mean independent Gaussian watermarks; future work should consider state- and frequency-shaped distributions for enhanced stealth and efficiency.
Figure 13: Multi-strobe Online Decision-making Pipeline for DynaMark.
Implications and Future Directions
DynaMark demonstrates that RL-based dynamic watermarking can robustly detect replay attacks in industrial controllers, outperforming static and optimization-based schemes, especially in non-LTI and time-varying environments. The framework's adaptability and model-free operation are critical for deployment in proprietary industrial systems. Future research should explore safe RL constraints, online watermark recovery for autonomous system restart, and advanced watermark shaping to counter adaptive adversaries.
Conclusion
DynaMark provides a principled, RL-driven approach to dynamic watermarking for industrial MTCs, achieving efficient replay attack detection with minimal control performance degradation and energy overhead. Its model-free, adaptive design overcomes the limitations of static and LTI-dependent schemes, offering a practical solution for securing cyber-physical manufacturing systems against evolving threats.