Green Resilience of Cyber-Physical Systems: Doctoral Dissertation

Published 20 Nov 2025 in cs.SE, cs.AI, cs.CV, and cs.RO | (2511.16593v1)

Abstract: Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a resilience model using the Autonomous Classification Ratio to detect performance degradation and optimize recovery while minimizing CO₂ emissions.
It compares decision-making policies—including multi-objective optimization, game-theoretic, and reinforcement learning approaches—to balance rapid recovery with energy costs.
Empirical tests on a collaborative robot demonstrate that containerization cuts CO₂ emissions by 50% and RL-agent policies achieve 20% faster recovery despite higher energy demands.

Green Resilience of Cyber-Physical Systems: Detailed Expert Summary

Introduction and Research Context

This dissertation addresses resilience and greenness in Online Collaborative Artificial Intelligence Systems (OL-CAIS), a human-centric subclass of Cyber-Physical Systems (CPS) that learn tasks online through human interaction. The work is situated within the paradigm of Industry 4.0, leveraging continual and online learning to maximize system autonomy and minimize reliance on human intervention. The focus is on how OL-CAIS respond to exogenous disruptive events—environmental changes that degrade system performance—and the dual challenge of restoring performance (resilience) without incurring unacceptable energy costs or CO₂ emissions (greenness).

Figure 1: Key research challenges facing industrial OL-CAIS—online learning for steady performance, performance degradation after disruption, and the required balance between resilience and greenness during recovery.

The thesis systematically models OL-CAIS runtime behavior, develops intelligent decision-making frameworks that operationalize the trade-off between resilience and greenness, and confronts the issue of catastrophic forgetting in online learning. The practical context is a collaborative robot (CORAL), used as an experimental testbed for both real-world and simulated evaluations.

Theoretical Frameworks and Methodology

Resilience Modeling

Central to the work is a resilience model tracking the Autonomous Classification Ratio (ACR)—the fraction of actions performed autonomously—over sliding windows of system iterations. Three operational states are defined: steady (autonomous), disruptive (decreased ACR/performances), and final (post-disruption, possible memory degradation). The model is designed for runtime detection of performance degradation and automated classification of system states.

Figure 2: The resilience model, showing performance drops and recoveries as the system responds to disruptive events.

Decision-Making Policies

Recovery policies are structured within the GResilience framework, encompassing three agent-based approaches:

One-agent (Multi-objective Optimization): Weighted Sum Model (WSM) ranks feasible actions, balancing normalized metrics for time-to-recover (resilience) and CO₂/human interaction costs (greenness).
Two-agent (Game-theoretic): Non-cooperative game with greenness and resilience as players; uses payoff matrices to identify Pure or Mixed Strategy Nash Equilibria for optimal action selection.
RL-agent (Reinforcement Learning): Q-learning agent whose state incorporates accumulated greenness and resilience metrics, optimizing for rapid, autonomous recovery under delayed rewards.
Figure 3: GResilience framework components illustrating iterative action evaluation and policy selection during OL-CAIS recovery.

Measurement and Evaluation

A custom measurement framework quantifies:

Resilience: Recovery speed (disruptive-to-recovered transition time ratio), performance steadiness.
Greenness: Mean CO₂ emissions, degree of human intervention (autonomy metric).

The theoretical constructs are implemented in CAIS-DMA: a modular decision-making assistant supporting simulation, monitoring, and online deployment.

Experimental Validation

Real-world and simulated experiments are executed with the CORAL robot for representative disruptive events: hardware failure (lights outage) and adversarial attacks (histogram manipulation of color images). All agent-based policies are benchmarked versus baseline internal policies. The empirical findings are:

All agent-based policies accelerate green recovery over internal policies—substantially reducing time-to-recovery, performance fluctuations, and human dependency.
RL-agent policies achieve fastest recovery, highest steadiness, and improved autonomy, but increase energy/CO₂ costs due to extra computation.
Containerization of system components yields dramatic greenness improvements, halving CO₂ emissions compared to bare-metal deployments (with negligible negative impact on autonomy).
Figure 4: Containerization methodology for OL-CAIS, enabling resource consolidation and orchestration for green, resilient operation.

Catastrophic Forgetting and State Dynamics

The thesis exposes that disruptive events cause OL-CAIS classifiers to forget original environmental conditions, leading to post-recovery performance degradation (catastrophic forgetting). Recoveries in the final state require renewed human interaction and learning. Continuous support via intelligent policies is necessary to maintain steady autonomous performance in environments with frequent or repeated disruptions.

Key Numerical Results and Contradictory Findings

Recovery duration under RL policies is reduced by up to 20% compared to internal and optimization/game-theoretic approaches, and performance fluctuation ratios decrease by 50–85% compared to baselines.
Autonomy ratios in recovery states increase by 15–45% under agent policies, with RL-agent consistently highest.
Containerization reduced measured CO₂e by approximately 50% over bare-metal (0.099 vs. 0.198 kg CO₂e for two-hour continuous classification).
RL-agent policies achieved this superior resilience at a statistically significant (p < 0.01) cost in mean CO₂ emissions (up to 70% greater than other policy approaches).

Practical and Theoretical Implications

The theoretical frameworks and empirical tools (ACR-based resilience modeling, GResilience, CAIS-DMA) operationalize the trade-off between greenness and resilience for human-centric CPS. Practically, the methodology and containerization yield actionable energy and sustainability improvements for industrial robots and OL-CAIS. The findings clarify that aggressive resilience—even with state-of-the-art RL policies—can contradict greenness goals due to heavier computation. This insight informs future system design: decision-makers must explicitly model and optimize for carbon-aware recovery, not merely rapid restoration.

On the theoretical front, performance evolution modeling and equilibrium-based policies can generalize to broader CPS domains: energy grids, healthcare, and any AI-driven system with adaptive autonomy. Future work could incorporate multi-model state machines for robust forgetting mitigation, and runtime-aware policy containers for continual state re-balancing.

Conclusion

This dissertation provides methodological, empirical, and architectural advances for environmentally responsible, resilient OL-CAIS. Decision-makers are equipped with metrics, models, and agent-based policies that enable informed, runtime-optimal action selection. The work demonstrates that resilience and greenness are often competing objectives; achieving sustainable, resilient CPS requires explicit, agent-based trade-off modeling, containerized resource management, and continual adaptation to disruptive events. The frameworks herein support scalable deployment and evolution of green-resilient industrial CPS.

Markdown