Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion
Abstract: Adversarial attacks against LLMs(LMs) are a significant concern. In particular, adversarial samples exploit the model's sensitivity to small input changes. While these changes appear insignificant on the semantics of the input sample, they result in significant decay in model performance. In this paper, we propose Targeted Paraphrasing via RL (TPRL), an approach to automatically learn a policy to generate challenging samples that most likely improve the model's performance. TPRL leverages FLAN T5, a LLM, as a generator and employs a self learned policy using a proximal policy gradient to generate the adversarial examples automatically. TPRL's reward is based on the confusion induced in the classifier, preserving the original text meaning through a Mutual Implication score. We demonstrate and evaluate TPRL's effectiveness in discovering natural adversarial attacks and improving model performance through extensive experiments on four diverse NLP classification tasks via Automatic and Human evaluation. TPRL outperforms strong baselines, exhibits generalizability across classifiers and datasets, and combines the strengths of language modeling and reinforcement learning to generate diverse and influential adversarial examples.
- Impact of adversarial training on robustness and generalizability of language models. arXiv preprint arXiv:2211.05523.
- Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
- Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675.
- Robust neural machine translation with doubly adversarial inputs. arXiv preprint arXiv:1906.02443.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
- Prithiviraj Damodaran. 2021. Parrot: Paraphrase generation for nlu.
- Hate speech dataset from a white supremacy forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 11–20, Brussels, Belgium. Association for Computational Linguistics.
- ValCAT: Variable-Length Contextualized Adversarial Transformations Using Encoder-Decoder Language Model. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1735–1746, Seattle, United States. Association for Computational Linguistics.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
- Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
- Parade: A new dataset for paraphrase identification requiring computer science domain knowledge. arXiv preprint arXiv:2010.03725.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Learning to write with cooperative discriminators. arXiv preprint arXiv:1805.06087.
- First quora dataset release: Question pairs. data. quora. com.
- Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885, New Orleans, Louisiana. Association for Computational Linguistics.
- Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328.
- Certified robustness to adversarial word substitutions. arXiv preprint arXiv:1909.00986.
- Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8018–8025.
- Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4360–4379, Singapore. Association for Computational Linguistics.
- Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81–93.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201.
- Reformulating unsupervised style transfer as paraphrase generation. arXiv preprint arXiv:2010.05700.
- A continuously growing dataset of sentential paraphrases. arXiv preprint arXiv:1708.00391.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Semi-supervised adversarial text generation based on Seq2Seq models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 254–262, Abu Dhabi, UAE. Association for Computational Linguistics.
- Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9031–9041. ArXiv:2012.02952 [cs].
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
- Combining fact extraction and verification with neural semantic matching networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6859–6866.
- Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
- Animesh Nighojkar and John Licato. 2021. Improving paraphrase detection with the adversarial paraphrasing task. arXiv preprint arXiv:2106.07691.
- Automatic differentiation in pytorch.
- A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
- Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4569–4580, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Adversarial attack and defense technologies in natural language processing: A survey. Neurocomputing, 492:278–307.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867.
- Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588.
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
- Trl: Transformer reinforcement learning. https://github.com/lvwerra/trl.
- Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125.
- CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation. ArXiv:2010.02338 [cs].
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.
- William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2557–2563.
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
- John Wieting and Kevin Gimpel. 2018. ParaNMT-50M: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 451–462, Melbourne, Australia. Association for Computational Linguistics.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- LexicalAT: Lexical-Based Adversarial Reinforcement Training for Robust Sentiment Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5518–5527, Hong Kong, China. Association for Computational Linguistics.
- SemEval-2015 task 1: Paraphrase and semantic similarity in Twitter (PIT). In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 1–11, Denver, Colorado. Association for Computational Linguistics.
- Jin Yong Yoo and Yanjun Qi. 2021. Towards improving adversarial training of nlp models. arXiv preprint arXiv:2109.00544.
- Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541.
- Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
- Generating Natural Adversarial Examples. ArXiv:1710.11342 [cs].
- Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.