From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models
Abstract: To date, toxicity mitigation in LLMs has almost entirely been focused on single-language settings. As LLMs embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques. We also compare finetuning mitigation approaches against retrieval-augmented techniques under both static and continual toxicity mitigation scenarios. This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation. We also explore how model size and data quantity affect the success of these mitigation efforts. Covering nine languages, our study represents a broad array of linguistic families and levels of resource availability, ranging from high to mid-resource languages. Through comprehensive experiments, we provide insights into the complexities of multilingual toxicity mitigation, offering valuable insights and paving the way for future research in this increasingly important field. Code and data are available at https://github.com/for-ai/goodtriever.
- Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 937–947, 2017.
- Assessing reference-free peer evaluation for machine translation. arXiv preprint arXiv:2104.05146, 2021.
- Towards accurate detection of offensive language in online communication in arabic. Procedia computer science, 142:315–320, 2018.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856, 2019.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 610–623, 2021.
- Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, 2019.
- Patterns of implicit and explicit stereotypes iii: Long-term change in gender stereotypes. Social Psychological and Personality Science, 13(1):14–26, 2022.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022a.
- Toxicity in multilingual machine translation at scale. arXiv preprint arXiv:2210.03070, 2022b.
- Multilingual holistic bias: Extending descriptors and patterns to unveil demographic biases in languages at scale. arXiv preprint arXiv:2305.13198, 2023a.
- Added toxicity mitigation at inference time for multimodal and massively multilingual translation. arXiv preprint arXiv:2311.06532, 2023b.
- Text detoxification using large pre-trained neural models. arXiv preprint arXiv:2109.08914, 2021.
- Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023.
- Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
- Nl-augmenter: A framework for task-sensitive natural language augmentation. arXiv preprint arXiv:2112.02721, 2021.
- Americasnli: Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. arXiv preprint arXiv:2104.08726, 2021.
- Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020.
- Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
- Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pp. 4411–4421. PMLR, 2020.
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
- Development of a speech recognition system for icelandic using machine translated text. In Spoken Languages Technologies for Under-Resourced Languages, 2008.
- Unsung challenges of building and deploying language technologies for low resource language communities. arXiv preprint arXiv:1912.03457, 2019.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
- Should we translate? evaluating toxicity in online comments when translating from portuguese to english. In Proceedings of the Brazilian Symposium on Multimedia and the Web, pp. 89–98, 2022.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020.
- Pre-trained multilingual sequence-to-sequence models: A hope for low-resource language translation? arXiv preprint arXiv:2203.08850, 2022.
- A new generation of perspective api: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3197–3207, 2022.
- A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055, 2015.
- Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668, 2021.
- Dexperts: Decoding-time controlled text generation with experts and anti-experts. arXiv preprint arXiv:2105.03023, 2021.
- Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852, 2023.
- Are gender stereotypes changing over time? a cross-temporal analysis of perceptions about gender stereotypes in spain (?‘ están cambiando los estereotipos de género con el tiempo? un análisis transtemporal de las percepciones sobre los estereotipos de género en españa). International Journal of Social Psychology, 36(2):330–354, 2021.
- Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages. In Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation, pp. 14–17, 2019.
- How translation alters sentiment. Journal of Artificial Intelligence Research, 55:95–130, 2016.
- Toxic bias: Perspective api misreads german as more toxic. arXiv preprint arXiv:2312.12651, 2023.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022. Accessed: 2023-06-13.
- Irina Ovchinnikova. Impact of new technologies on the types of translation errors. In CEUR Workshop Proceedings, 2020.
- Maja Popović. chrf++: words helping character n-grams. In Proceedings of the second conference on machine translation, pp. 612–618, 2017.
- On the challenges of using black-box apis for toxicity evaluation in research. arXiv preprint arXiv:2304.12397, 2023a.
- Goodtriever: Adaptive toxicity mitigation with retrieval-augmented models. arXiv preprint arXiv:2310.07589, 2023b.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Data augmentation for low resource languages. In INTERSPEECH 2014: 15th Annual Conference of the International Speech Communication Association, pp. 810–814. International Speech Communication Association (ISCA), 2014.
- Towards red teaming in multimodal and multilingual translation, 2024.
- What makes a good conversation? how controllable attributes affect human judgments. arXiv preprint arXiv:1902.08654, 2019.
- The language barrier: Dissecting safety challenges of llms in multilingual contexts. arXiv preprint arXiv:2401.13136, 2024.
- The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326, 2019.
- Towards controllable biases in language generation. arXiv preprint arXiv:2005.00268, 2020.
- mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580, 2022.
- Catriona Silvey. Speaking our minds: Why human communication is different, and how language evolved to make it special, by thom scott-phillips, 2016.
- Aya dataset: An open-access collection for multilingual instruction tuning, 2024.
- " i’m sorry to hear that": finding bias in language models with a holistic descriptor dataset. arXiv preprint arXiv:2205.09209, 2022.
- You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pp. 26–41, 2022.
- Aya model: An instruction finetuned open-access multilingual language model, 2024.
- Learning from the worst: Dynamically generated datasets to improve online hate detection. arXiv preprint arXiv:2012.15761, 2020.
- Exploring the limits of domain-adaptive training for detoxifying large-scale language models. arXiv preprint arXiv:2202.04173, 2022.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.