Evaluate deduplication impact on the Parody dataset

Demonstrate the effect of removing duplicate and near-duplicate posts on political parody classification performance for the "Analyzing Political Parody in Social Media" Twitter dataset (Parody) by training models under three training-set configurations: (i) Original (random 80/20 train–test split with standard preprocessing), (ii) w/o Duplicates (removing from the training set all posts identical to test posts), and (iii) w/o Near-Duplicates (further removing from the training set posts that are near-duplicates of test posts as determined by Levenshtein distance with threshold 20), since prior experiments could not be conducted due to incomplete data availability.

Background

The paper conducts a meta-analysis of 20 social media datasets across multiple Computational Social Science tasks to assess data quality issues arising from duplicate and near-duplicate posts and evaluates how deduplication affects model performance.

For most datasets, the authors report results under three configurations (Original, w/o Duplicates, w/o Near-Duplicates) to quantify label leakage and performance overestimation. However, they were not able to run these experiments for the Parody dataset because the dataset they obtained was incomplete, leaving the impact of duplication on this dataset unreported.

Given that the dataset specifications table shows that the Parody dataset contains near-duplicates (with a non-trivial reduction after near-duplicate removal), completing this evaluation would close a gap in the comparative analysis and clarify the deduplication effects for political parody detection.

References

We were unable to conduct experiments on the Parody dataset due to the incomplete dataset we obtained.

Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research  (2410.03545 - Mu et al., 2024) in Table 3 caption, Section 5 (Impact of Duplicate and Near-duplicate Samples)