Evaluate deduplication impact on the Parody dataset
Demonstrate the effect of removing duplicate and near-duplicate posts on political parody classification performance for the "Analyzing Political Parody in Social Media" Twitter dataset (Parody) by training models under three training-set configurations: (i) Original (random 80/20 train–test split with standard preprocessing), (ii) w/o Duplicates (removing from the training set all posts identical to test posts), and (iii) w/o Near-Duplicates (further removing from the training set posts that are near-duplicates of test posts as determined by Levenshtein distance with threshold 20), since prior experiments could not be conducted due to incomplete data availability.
References
We were unable to conduct experiments on the Parody dataset due to the incomplete dataset we obtained.