Multimodal Coreference Resolution for Chinese Social Media Dialogues: Dataset and Benchmark Approach

Published 19 Apr 2025 in cs.CL | (2504.14321v2)

Abstract: Multimodal coreference resolution (MCR) aims to identify mentions referring to the same entity across different modalities, such as text and visuals, and is essential for understanding multimodal content. In the era of rapidly growing mutimodal content and social media, MCR is particularly crucial for interpreting user interactions and bridging text-visual references to improve communication and personalization. However, MCR research for real-world dialogues remains unexplored due to the lack of sufficient data resources. To address this gap, we introduce TikTalkCoref, the first Chinese multimodal coreference dataset for social media in real-world scenarios, derived from the popular Douyin short-video platform. This dataset pairs short videos with corresponding textual dialogues from user comments and includes manually annotated coreference clusters for both person mentions in the text and the coreferential person head regions in the corresponding video frames. We also present an effective benchmark approach for MCR, focusing on the celebrity domain, and conduct extensive experiments on our dataset, providing reliable benchmark results for this newly constructed dataset. We will release the TikTalkCoref dataset to facilitate future research on MCR for real-world social media dialogues.

Abstract PDF Upgrade to Chat

Summary

The paper introduces TikTalkCoref, the first Chinese multimodal coreference resolution dataset tailored for Douyin dialogues.
It presents a robust three-part pipeline integrating textual coreference, visual tracking, and cross-modal alignment through contrastive learning.
Experiments demonstrate enhanced accuracy in entity linking and effective handling of singleton clusters in dynamic social media contexts.

The paper presents TikTalkCoref, the first Chinese multimodal coreference resolution dataset tailored for real-world social media dialogues, specifically from the Douyin platform. It addresses critical gaps in multimodal CR research for naturally occurring dialogues, providing valuable resources and a robust benchmark method for effective evaluation. The study includes detailed analysis on the necessity of multimedia integration in understanding dialogic interactions and introduces a sophisticated pipeline architecture for MCR tasks.

Introduction

Multimodal Coreference Resolution (MCR) expands the traditional coreference resolution domain by integrating visual information with textual context, enabling more comprehensive entity identification across modalities. The advent of short video platforms, such as Douyin, amplifies the need for MCR to improve content personalization and interaction understanding. TikTalkCoref significantly advances this field by offering a dataset with meticulously annotated coreference clusters, linking textual mentions to corresponding visual identifiers within video frames. Consequently, this enhances the fidelity of CR models to real-world social media dialogues.

Figure 1: An example of TikTalkCoref.

TikTalkCoref's annotations, demonstrated in Figure 1, underscore its focus on person entities within texts and visuals, establishing an innovative mapping between dialogues and video representations.

Previous MCR endeavors, primarily in English, centered on human-machine dialogue or media descriptions, lacking coverage for dynamic social interactions. Datasets like MPII-MD and VisPro illustrate multimodal CR by aligning textual narratives with images or videos, but fail to capture the nuanced interactions present in social dialogues. TikTalkCoref fills this void, presenting the first Chinese dataset to address this complexity in naturally occurring social media contexts.

Figure 2: The overview of our model architecture.

Methodology

The proposed approach for MCR involves a three-part pipeline: textual coreference resolution, visual person tracking, and cross-modal alignment, as depicted in Figure 2.

Textual Coreference Resolution

The textual resolution framework utilizes a DeBERTa-based Maverick model to extract mentions and cluster them, employing a coarse-to-fine antecedent scoring approach. This is instrumental in handling the high variance in mention types and densities typical to social media dialogues.

Visual Person Tracking

Employing YOLOv5 and DeepSORT algorithms, the model achieves robust tracking of head regions across video frames. The integration of MTCNN and MobileFaceNet ensures accuracy in face recognition and clustering, adapting to the challenges posed by disjointed video segments in social media posts.

Textual and Visual Coreference Alignment

Using Chinese CLIP, the method achieves alignment across modalities via contrastive learning, enhancing retrieval accuracy of coreferential entities in text and visuals. This process aligns coreferential textual clusters with respective visual entities, crucial for establishing seamless multimodal relationships.

Experiments and Results

Experiments demonstrate significant improvements in coreference resolution accuracy using TikTalkCoref. Maverick outperforms existing models significantly, particularly in handling singleton clusters due to its specialized mention clustering technique. The bilingual coreference alignment system achieves high retrieval accuracy, benefiting extensively from domain-specific fine-tuning.

Figure 3: The distribution of gender, age and occupation of persons in the TikTalkCoref.

Empirical analysis reveals insights into gender, age, and occupation distributions, as shown in Figure 3, highlighting demographics predominant in social media contexts, thereby guiding future enhancements in model training paradigms.

Conclusion

TikTalkCoref establishes a crucial foundation for future research aimed at enhancing MCR in dialogic exchanges on social media. The dataset's release facilitates exploration of nuanced, multimodal interactions, advancing both theoretical and practical understanding of CR applications in real-world settings. Future work will explore scaling the dataset and employing semi-supervised methodologies for expanded applicability across diverse multimedia landscapes.