Contrastive News and Social Media Linking using BERT for Articles and Tweets across Dual Platforms
Abstract: X (formerly Twitter) has evolved into a contemporary agora, offering a platform for individuals to express opinions and viewpoints on current events. The majority of the topics discussed on Twitter are directly related to ongoing events, making it an important source for monitoring public discourse. However, linking tweets to specific news presents a significant challenge due to their concise and informal nature. Previous approaches, including topic models, graph-based models, and supervised classifiers, have fallen short in effectively capturing the unique characteristics of tweets and articles. Inspired by the success of the CLIP model in computer vision, which employs contrastive learning to model similarities between images and captions, this paper introduces a contrastive learning approach for training a representation space where linked articles and tweets exhibit proximity. We present our contrastive learning approach, CATBERT (Contrastive Articles Tweets BERT), leveraging pre-trained BERT models. The model is trained and tested on a dataset containing manually labeled English and Polish tweets and articles related to the Russian-Ukrainian war. We evaluate CATBERT's performance against traditional approaches like LDA, and the novel method based on OpenAI embeddings, which has not been previously applied to this task. Our findings indicate that CATBERT demonstrates superior performance in associating tweets with relevant news articles. Furthermore, we demonstrate the performance of the models when applied to finding the main topic -- represented by an article -- of the whole cascade of tweets. In this new task, we report the performance of the different models in dependence on the cascade size.
- Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
- TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1644–1650. https://doi.org/10.18653/v1/2020.findings-emnlp.148
- Longformer: The Long-Document Transformer. arXiv:2004.05150 [cs.CL]
- Michał Brzozowski and Marek Wachnicki. 2023. BELT (BERT For Longer Texts). Retrieved 2023-10-13 from https://github.com/mim-solutions/bert_for_longer_texts
- Justine Calma. 2023. Twitter just closed the book on academic research. Retrieved 2023-10-09 from https://www.theverge.com/2023/5/31/23739084/twitter-elon-musk-api-policy-chilling-academic-research
- Sławomir Dadas. 2022. Polish Longformer. Retrieved 2023-10-13 from https://huggingface.co/sdadas/polish-longformer-base-4096/tree/main
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
- Hugging Face. 2023. MTEB Leaderboard. Retrieved 2023-10-11 from https://huggingface.co/spaces/mteb/leaderboard
- DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2013/file/7cce53cf90577442771720a370c3c723-Paper.pdf
- New and improved embedding model. Retrieved 2023-10-11 from https://openai.com/blog/new-and-improved-embedding-model
- Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Sofia, Bulgaria, 239–249. https://aclanthology.org/P13-1024
- Learning Visual Features from Large Weakly Supervised Data. arXiv:1511.02251 [cs.CV]
- Tweet-Recommender: Finding Relevant Tweets for News Articles. The Web Conference (2015). https://doi.org/10.1145/2740908.2742716
- What is Twitter, a Social Network or a News Media?. In Proceedings of the 19th International Conference on World Wide Web (Raleigh, North Carolina, USA) (WWW ’10). Association for Computing Machinery, New York, NY, USA, 591–600. https://doi.org/10.1145/1772690.1772751
- J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 1 (mar 1977), 159. https://doi.org/10.2307/2529310
- Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. Retrieved 2023-10-13 from https://openreview.net/forum?id=S7Evzt9uit3
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
- Event mining and timeliness analysis from heterogeneous news streams. Information Processing & Management 56, 3 (2019), 969–993. https://doi.org/10.1016/j.ipm.2019.02.003
- Linking Tweets with Monolingual and Cross-Lingual News using Transformed Word Embeddings. CoRR abs/1710.09137 (2017). arXiv:1710.09137 http://arxiv.org/abs/1710.09137
- HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. Association for Computational Linguistics, Kiyv, Ukraine, 1–10. https://www.aclweb.org/anthology/2021.bsnlp-1.1
- Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
- GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162
- Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV]
- Richard Socher and Li Fei-Fei. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 966–973. https://doi.org/10.1109/CVPR.2010.5540112
- Andreas Spitz and Michael Gertz. 2018. Exploring Entity-centric Networks in Entangled News Streams. In Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW ’18. ACM Press, Lyon, France, 555–563. https://doi.org/10.1145/3184558.3188726
- A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles. In Advances in Information Retrieval, Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Vol. 10772. Springer International Publishing, Cham, 780–786. https://doi.org/10.1007/978-3-319-76941-7_76 Series Title: Lecture Notes in Computer Science.
- A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles. European Conference on Information Retrieval (2018). https://doi.org/10.1007/978-3-319-76941-7_76
- Unified Representation of Twitter and Online News Using Graph and Entities. Frontiers in Big Data (2021). https://doi.org/10.3389/fdata.2021.699070
- TrelBERT: A pre-trained encoder for Polish Twitter. In Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). Association for Computational Linguistics, Dubrovnik, Croatia, 17–24. https://aclanthology.org/2023.bsnlp-1.3
- Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv:2010.00747 [cs.CV]
- Comparing Twitter and Traditional Media Using Topic Models. In Advances in Information Retrieval, Paul Clough, Colum Foley, Cathal Gurrin, Gareth J. F. Jones, Wessel Kraaij, Hyowon Lee, and Vanessa Mudoch (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 338–349.
- Is Self-Supervised Contrastive Learning More Robust Than Supervised Learning? https://openreview.net/forum?id=FPdDFUVYVPl
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.