Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive News and Social Media Linking using BERT for Articles and Tweets across Dual Platforms

Published 11 Dec 2023 in cs.CL and cs.LG | (2312.07599v1)

Abstract: X (formerly Twitter) has evolved into a contemporary agora, offering a platform for individuals to express opinions and viewpoints on current events. The majority of the topics discussed on Twitter are directly related to ongoing events, making it an important source for monitoring public discourse. However, linking tweets to specific news presents a significant challenge due to their concise and informal nature. Previous approaches, including topic models, graph-based models, and supervised classifiers, have fallen short in effectively capturing the unique characteristics of tweets and articles. Inspired by the success of the CLIP model in computer vision, which employs contrastive learning to model similarities between images and captions, this paper introduces a contrastive learning approach for training a representation space where linked articles and tweets exhibit proximity. We present our contrastive learning approach, CATBERT (Contrastive Articles Tweets BERT), leveraging pre-trained BERT models. The model is trained and tested on a dataset containing manually labeled English and Polish tweets and articles related to the Russian-Ukrainian war. We evaluate CATBERT's performance against traditional approaches like LDA, and the novel method based on OpenAI embeddings, which has not been previously applied to this task. Our findings indicate that CATBERT demonstrates superior performance in associating tweets with relevant news articles. Furthermore, we demonstrate the performance of the models when applied to finding the main topic -- represented by an article -- of the whole cascade of tweets. In this new task, we report the performance of the different models in dependence on the cascade size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  2. TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1644–1650. https://doi.org/10.18653/v1/2020.findings-emnlp.148
  3. Longformer: The Long-Document Transformer. arXiv:2004.05150 [cs.CL]
  4. Michał Brzozowski and Marek Wachnicki. 2023. BELT (BERT For Longer Texts). Retrieved 2023-10-13 from https://github.com/mim-solutions/bert_for_longer_texts
  5. Justine Calma. 2023. Twitter just closed the book on academic research. Retrieved 2023-10-09 from https://www.theverge.com/2023/5/31/23739084/twitter-elon-musk-api-policy-chilling-academic-research
  6. Sławomir Dadas. 2022. Polish Longformer. Retrieved 2023-10-13 from https://huggingface.co/sdadas/polish-longformer-base-4096/tree/main
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
  8. Hugging Face. 2023. MTEB Leaderboard. Retrieved 2023-10-11 from https://huggingface.co/spaces/mteb/leaderboard
  9. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2013/file/7cce53cf90577442771720a370c3c723-Paper.pdf
  10. New and improved embedding model. Retrieved 2023-10-11 from https://openai.com/blog/new-and-improved-embedding-model
  11. Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Sofia, Bulgaria, 239–249. https://aclanthology.org/P13-1024
  12. Learning Visual Features from Large Weakly Supervised Data. arXiv:1511.02251 [cs.CV]
  13. Tweet-Recommender: Finding Relevant Tweets for News Articles. The Web Conference (2015). https://doi.org/10.1145/2740908.2742716
  14. What is Twitter, a Social Network or a News Media?. In Proceedings of the 19th International Conference on World Wide Web (Raleigh, North Carolina, USA) (WWW ’10). Association for Computing Machinery, New York, NY, USA, 591–600. https://doi.org/10.1145/1772690.1772751
  15. J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 1 (mar 1977), 159. https://doi.org/10.2307/2529310
  16. Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. Retrieved 2023-10-13 from https://openreview.net/forum?id=S7Evzt9uit3
  17. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
  18. Event mining and timeliness analysis from heterogeneous news streams. Information Processing & Management 56, 3 (2019), 969–993. https://doi.org/10.1016/j.ipm.2019.02.003
  19. Linking Tweets with Monolingual and Cross-Lingual News using Transformed Word Embeddings. CoRR abs/1710.09137 (2017). arXiv:1710.09137 http://arxiv.org/abs/1710.09137
  20. HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. Association for Computational Linguistics, Kiyv, Ukraine, 1–10. https://www.aclweb.org/anthology/2021.bsnlp-1.1
  21. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  22. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162
  23. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV]
  24. Richard Socher and Li Fei-Fei. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 966–973. https://doi.org/10.1109/CVPR.2010.5540112
  25. Andreas Spitz and Michael Gertz. 2018. Exploring Entity-centric Networks in Entangled News Streams. In Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW ’18. ACM Press, Lyon, France, 555–563. https://doi.org/10.1145/3184558.3188726
  26. A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles. In Advances in Information Retrieval, Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Vol. 10772. Springer International Publishing, Cham, 780–786. https://doi.org/10.1007/978-3-319-76941-7_76 Series Title: Lecture Notes in Computer Science.
  27. A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles. European Conference on Information Retrieval (2018). https://doi.org/10.1007/978-3-319-76941-7_76
  28. Unified Representation of Twitter and Online News Using Graph and Entities. Frontiers in Big Data (2021). https://doi.org/10.3389/fdata.2021.699070
  29. TrelBERT: A pre-trained encoder for Polish Twitter. In Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). Association for Computational Linguistics, Dubrovnik, Croatia, 17–24. https://aclanthology.org/2023.bsnlp-1.3
  30. Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv:2010.00747 [cs.CV]
  31. Comparing Twitter and Traditional Media Using Topic Models. In Advances in Information Retrieval, Paul Clough, Colum Foley, Cathal Gurrin, Gareth J. F. Jones, Wessel Kraaij, Hyowon Lee, and Vanessa Mudoch (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 338–349.
  32. Is Self-Supervised Contrastive Learning More Robust Than Supervised Learning? https://openreview.net/forum?id=FPdDFUVYVPl

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.