Where and How to Attack? A Causality-Inspired Recipe for Generating Counterfactual Adversarial Examples

Published 21 Dec 2023 in cs.LG | (2312.13628v2)

Abstract: Deep neural networks (DNNs) have been demonstrated to be vulnerable to well-crafted \emph{adversarial examples}, which are generated through either well-conceived $\mathcal{L}_p$-norm restricted or unrestricted attacks. Nevertheless, the majority of those approaches assume that adversaries can modify any features as they wish, and neglect the causal generating process of the data, which is unreasonable and unpractical. For instance, a modification in income would inevitably impact features like the debt-to-income ratio within a banking system. By considering the underappreciated causal generating process, first, we pinpoint the source of the vulnerability of DNNs via the lens of causality, then give theoretical results to answer \emph{where to attack}. Second, considering the consequences of the attack interventions on the current state of the examples to generate more realistic adversarial examples, we propose CADE, a framework that can generate \textbf{C}ounterfactual \textbf{AD}versarial \textbf{E}xamples to answer \emph{how to attack}. The empirical results demonstrate CADE's effectiveness, as evidenced by its competitive performance across diverse attack scenarios, including white-box, transfer-based, and random intervention attacks.

Abstract PDF HTML Upgrade to Chat

References (54)

Citations (2)

View on Semantic Scholar

Summary

The paper proposes CADE, a method that leverages causal processes to generate more realistic adversarial examples.
It employs counterfactual reasoning to alter latent variables, ensuring modifications align with the data's causal structure.
Empirical results demonstrate CADE's robust effectiveness in both white-box and transfer attack scenarios against DNNs.

Introduction

Advances in deep learning have led to significant improvements across various tasks and industries. Deep Neural Networks (DNNs) are at the heart of these advancements, but they are not without vulnerabilities. One major concern is their susceptibility to adversarial examples: inputs deliberately crafted to cause a DNN to make a mistake. Most adversarial attacks assume attackers can modify any feature of the data. However, this overlooks the fact that real-world data often adheres to a causal generating process. Ignoring this process can lead to unrealistic and impractical adversarial examples.

Theoretical Foundations

Addressing the gap between traditional adversarial attacks and realistic scenarios, the paper presents a new methodology focusing on the causal generating process behind data. The authors dissect DNN vulnerabilities through a causal lens and offer theoretical insights on where to direct an attack. They distinguish between attacking observable variables, like features in an image, and latent variables, like underlying causes that affect observable characteristics. The theoretical work demonstrates that modifying a model's inputs related to the target prediction via interventions can be an effective strategy for an attack. In opposition to other methods which may naively adjust variables, the authors’ approach suggests altering children and co-parents of the predictive target within a causal model to preserve the target itself.

Methodology Development

Based on these concepts, the paper introduces Counterfactual ADversarial Examples (CADE), an approach to formulate realistic adversarial attacks by considering what would happen if the world were different in a specific way (counterfactual thinking). In a structured manner, CADE generates adversarial examples that adjust the causes of data features while keeping the effects constant and aligned with the causal model. This method's applications range from attacking image classifiers, where the causal process could involve pixel relationships, to financial models predicting creditworthiness. Unlike existing adversarial attacks that might perturb pixel values directly, CADE offers a nuance of modifying the latent features that lead to those pixel values, thus aligning with the causal relationships inherent in the data.

Empirical Validation

The effectiveness of this new framework is evaluated in diverse attack scenarios, including white-box settings where the model internals are known, and transfer-based attacks where the adversarial examples are transferred to other models. The empirical results demonstrate the robustness and efficacy of CADE in generating adversarial examples that successfully fool DNNs. Furthermore, the strategy retains its effectiveness even when applied randomly, without specific knowledge about the model’s vulnerabilities. These results signify the importance of integrating a causal perspective into the generation of adversarial examples.

Conclusion

In conclusion, CADE represents a stride toward realistic and practical adversarial attacks. It encourages further research into incorporating causality in adversarial learning and points toward possible new avenues for both attacks and defense strategies. While CADE significantly advances the generation of adversarial examples, challenges remain, such as limited availability of causal knowledge in real-world domains. Future works are directed towards developing methods that can operate with partial causal information and extending the framework's applicability to physical scenarios.

Markdown Report Issue