- The paper introduces SPIR with the XS-Video dataset and NetGPT model, integrating GNNs and LLMs to predict long-term short-video influence.
- It leverages multi-platform data from five major Chinese platforms, capturing rich interactive metrics over a two-week observation period.
- Experimental results highlight NetGPT's superior accuracy and reduced error metrics, emphasizing the merit of combining graph structures with language reasoning.
Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model
Introduction
The paper presents a novel task titled Short-video Propagation Influence Rating (SPIR) alongside a comprehensive dataset, XS-Video, specifically designed to explore the propagation dynamics of short videos across multiple online platforms. The proliferation of short-video platforms has generated vast networks ripe for analysis, but existing research primarily focuses on simplistic popularity metrics such as views or likes. SPIR aims to predict the long-term influence of newly released short videos using a multi-dimensional approach, encompassing various interactions like shares, collects, and comments. This positions SPIR as a more holistic measure of video impact within digital ecosystems.
Figure 1: Short-video Propagation Influence Rating (SPIR): Predicting the influence level of a newly posted short-video that can be achieved in a long period.
XS-Video Dataset
XS-Video sets itself apart by incorporating data from five major Chinese platforms (Douyin, Kuaishou, Xigua, Toutiao, and Bilibili), offering breadth that is absent in single-platform datasets. It includes 117,720 videos and 381,926 samples, compiled with detailed interactions tracked over two weeks after posting. This nuanced data facilitates the improved annotation of video influence levels from 0 to 9, delivering a richer understanding of the factors that drive video propagation across platforms.
Figure 2: An example of short-video states/samples collected in our XS-Video dataset. The text is translated into English.
The dataset's construction involves daily updates on new videos and interaction metrics, ensuring that annotations reflect a comprehensive view of video influence. The broad coverage of interactions—views, likes, shares, collects, fans, and comments—allows researchers to explore understanding video dynamics.
Figure 3: Brief construction procedure of our XS-Video dataset: (1) Daily update of new short-videos and the interactive information of already collected short-videos; (2) Alignment of multi-dimensional interactive indicators (collected 2 weeks later than the publication of videos) for annotating the video propagation influence levels.
Proposed Model: NetGPT
SPIR's complexity necessitates sophisticated models capable of leveraging large-scale data. The authors introduce NetGPT, a Large Graph Model (LGM) integrating Graph Neural Networks (GNNs) with LLMs like Qwen2-VL. NetGPT employs a three-stage training mechanism: heterogeneous graph pretraining, supervised language fine-tuning, and task-oriented predictor fine-tuning. This design enables NetGPT to bridge the gap between graph data's structural nuances and LLMs' reasoning capabilities.
Figure 4: Framework of our proposed NetGPT model: (1) Pretrain a heterogeneous GNN to obtain the features of the video nodes; (2) Train a graph projector to bridge GNN feature space and the LLM embedding space by supervised instruction fine-tuning; (3) Fine-tuning the model with an additional predictor to obtain the final influence level of the short-videos.
Experimental Results
Experiments conducted on XS-Video exhibit NetGPT's superiority over current approaches. The model significantly outperforms GNNs, LLMs, and multimodal LLMs in SPIR tasks by effectively capturing complex interactions within video propagation graphs (Table 1). NetGPT's improved accuracy and reduced error metrics underscore the importance of integrating graph-structured data with LLMs for nuanced video influence analysis.
(Table 1)
Table 1: The results of Short-video Propagation Influence Rating (SPIR) on the XS-Video dataset. ↑ denotes the higher the better and ↓ denotes the lower the better.
Furthermore, ablation studies suggest that adding video content features and maintaining graph structure integrity greatly enhance model predictions. Evaluating predictions across different observation periods reveals NetGPT's consistent performance improvement over longer durations.


Figure 5: Results of long-, median-, and short-term prediction with the observation times of ≤3 days, ≤7 days, and >7 days.
Conclusion
The introduction of XS-Video alongside the SPIR task represents a significant step in understanding short-video dynamics. NetGPT, by combining GNNs and LLMs, showcases how large-scale data and LLMs can collaboratively enhance predictions of video influence in complex propagation networks. This work sets the stage for further exploration into cross-platform video analysis and its implications in sectors such as advertising, content recommendation, and social network dynamics. The open availability of the dataset and code will facilitate widespread engagement with these findings and encourage continued research in this domain.