Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages

Published 12 Aug 2021 in cs.CL and cs.CY | (2108.05927v1)

Abstract: With the growth of social media, the spread of hate speech is also increasing rapidly. Social media are widely used in many countries. Also Hate Speech is spreading in these countries. This brings a need for multilingual Hate Speech detection algorithms. Much research in this area is dedicated to English at the moment. The HASOC track intends to provide a platform to develop and optimize Hate Speech detection algorithms for Hindi, German and English. The dataset is collected from a Twitter archive and pre-classified by a machine learning system. HASOC has two sub-task for all three languages: task A is a binary classification problem (Hate and Not Offensive) while task B is a fine-grained classification problem for three classes (HATE) Hate speech, OFFENSIVE and PROFANITY. Overall, 252 runs were submitted by 40 teams. The performance of the best classification algorithms for task A are F1 measures of 0.51, 0.53 and 0.52 for English, Hindi, and German, respectively. For task B, the best classification algorithms achieved F1 measures of 0.26, 0.33 and 0.29 for English, Hindi, and German, respectively. This article presents the tasks and the data development as well as the results. The best performing algorithms were mainly variants of the transformer architecture BERT. However, also other systems were applied with good success

Abstract PDF Upgrade to Chat

Citations (176)

View on Semantic Scholar

Summary

The paper introduces a multilingual framework for hate speech and offensive content detection using binary and ternary classification tasks.
The methodology applies supervised machine learning with transformer models like BERT and BiLSTM with fastText embeddings, achieving F1-scores of 0.52 for coarse classification.
The study highlights challenges in addressing linguistic nuances and sampling bias, calling for future research on multimodal analysis and improved hate speech detection.

Analysis of HASOC Track at FIRE 2020: Multilingual Hate Speech Detection

The HASOC track at FIRE 2020 addresses the critical challenge of detecting hate speech and offensive content across Indo-European languages, specifically Hindi, German, and English. The proliferation of social media has exacerbated the dissemination of hate speech, necessitating robust, multilingual algorithms for effective identification. While significant research has concentrated on English, this track emphasizes the development of detection systems for languages with less resource availability, thus contributing to the broader field of hate speech identification.

Methodology and Tasks

The HASOC track operationalized its objectives through two primary tasks for each language. Task A involved coarse binary classification to distinguish between hate-offensive content (HOF) and non-hate-offensive content (NOT). Conversely, Task B entailed a more granular classification, partitioning content into three distinct categories: Hate speech (HATE), offensive speech (OFFN), and profanity (PRFN).

The dataset, sourced from a Twitter archive, was pre-classified using a supervised machine learning approach. The creation of this dataset is noteworthy for its attempted reduction of sampling bias, employing a strategy that eschews reliance on predetermined keywords, which has historically introduced notable bias into hate speech datasets.

Results and Observations

A total of 252 submissions were analyzed, with performance evaluated through F1-measures. Task A yielded an F1-score of approximately 0.52 for each language, while Task B recorded lower F1-scores ranging from 0.26 to 0.33. The reliance on transformer architectures, such as BERT and its derivatives like ALBERT and DistilBERT, dominated submissions, reflecting their standing as the current standard in natural language processing for such tasks. Notably, a BiLSTM model using fastText embeddings also delivered competitive results.

Implications and Future Directions

The results underscore the complexity of accurately identifying hate speech, particularly in multilingual contexts. The relatively low F1-scores indicate the inherent challenges in distinguishing subtle linguistic nuances and biases associated with hate speech. This calls for further refinement and innovation in algorithm development.

Practically, effective hate speech detection in multiple languages can aid platforms in curbing harmful content, thus fostering healthier online environments. Theoretically, expanding research to encompass less studied languages provides valuable insights into comparative linguistic structures and social dynamics reflected in online discourse.

Future research could benefit from integrating multimodal analysis, considering that visual elements often accompany textual hate speech online. Additionally, exploring the intersection of hate speech and misinformation rises in importance as malicious actors frequently propagate false information under the guise of offensive rhetoric.

The HASOC track serves as a pivotal initiative contributing to the ongoing evolution of hate speech detection systems, advocating for balanced approaches that respect free speech while maintaining societal decorum. The continued expansion of benchmarking efforts and exploration of novel machine learning techniques remain crucial to enhancing detection efficacy.

Markdown Report Issue