VK-LSVD: A Large-Scale Industrial Dataset for Short-Video Recommendation

Published 4 Feb 2026 in cs.IR and cs.CY | (2602.04567v2)

Abstract: Short-video recommendation presents unique challenges, such as modeling rapid user interest shifts from implicit feedback, but progress is constrained by a lack of large-scale open datasets that reflect real-world platform dynamics. To bridge this gap, we introduce the VK Large Short-Video Dataset (VK-LSVD), the largest publicly available industrial dataset of its kind. VK-LSVD offers an unprecedented scale of over 40 billion interactions from 10 million users and almost 20 million videos over six months, alongside rich features including content embeddings, diverse feedback signals, and contextual metadata. Our analysis supports the dataset's quality and diversity. The dataset's immediate impact is confirmed by its central role in the live VK RecSys Challenge 2025. VK-LSVD provides a vital, open dataset to use in building realistic benchmarks to accelerate research in sequential recommendation, cold-start scenarios, and next-generation recommender systems.

Abstract PDF Upgrade to Chat

Summary

The paper presents VK-LSVD, a dataset with over 40 billion interactions and diverse multi-modal metadata for realistic short-video recommendation studies.
It employs latent factor analysis to reveal heavy-tailed distributions and demographic structuring that influence user preferences.
Benchmark experiments using methods like iALS highlight challenges in sparsity and long-tail effects, setting new standards for evaluation.

VK-LSVD: A Large-Scale Industrial Dataset for Advancing Short-Video Recommendation

Motivation and Context

Short-video platforms present inherent challenges for recommender systems: rapid context shifts, a predominant reliance on implicit signals (e.g., watch time), multi-modal content, and evolving user preferences. Academic progress in the field is hampered by the lack of public datasets with sufficient interaction scale, temporal granularity, and realistic, multi-faceted feedback that collectively characterize industrial short-video ecosystems. Compared to established datasets such as Netflix Prize and MovieLens, which are limited in side information and sequential structure, and Kuaishou-derived releases that are user- or temporally restricted, there remains a significant gap for comprehensive datasets facilitating realistic experimentation. VK-LSVD is designed to directly address this vacuum and provide a robust public foundation for recommender systems research.

Dataset Characteristics and Technical Details

VK-LSVD comprises over 40 billion user-item interactions, sourced from 10 million users and nearly 20 million videos over a 6-month window. Crucially, the dataset captures rich, multi-modal content-side embeddings (compressed via SVD), user socio-demographics, a spectrum of implicit (watch time), explicit (like/dislike), viral (shares), and deep engagement (e.g., comment opens, bookmarks) feedback channels, and comprehensive contextual metadata for each interaction (including platform, consumption context, and client agent). Anonymization is enforced across all categorical fields, fully removing PII and complying with privacy regulations.

The dataset’s structure adheres to best engineering practices:

Chronologically ordered events partitioned by week (train/valid/test) with an enforced global temporal split for real-world sequential evaluation.
Static user and item metadata files supporting activity- or popularity-based subset sampling for diverse experimental setups.
Item embeddings in 64-dimensional float16 arrays, supporting scalable and efficient heterogeneous modeling.

Every user-item pair records parameters for the first exposure; watch time aggregates successive replays up to a bounded cap. The dataset’s density is 0.0208%, emphasizing the sparsity characteristic of industrial recommender settings.

Statistical Properties and Latent Factor Analysis

VK-LSVD systematically demonstrates the behavioral and statistical regularities of real-world platforms. User activity and item popularity distributions are shown to be heavy-tailed, indicative of a small fraction of highly active users and viral content drivers, with the majority of users and items exhibiting low to moderate engagement.

Figure 1: Key user and item statistics reveal characteristic heavy-tailed distributions, with the majority of users and items responsible for a small fraction of interactions but a small subset dominating overall engagement.

Latent factor analysis via iALS (with positive signals defined as watch time $>$ 10s and positive feedback, negative signals as dislikes) reveals:

User latent similarities are most structured by demographic attributes, particularly gender and age cohorts.
Item latent vectors show high cosine similarity for items from the same author, and even higher for items sharing full metadata (content clusters, duration, author).
Figure 2: Cosine similarity matrices of iALS latent factors as a function of user and item attributes, supporting strong demographic and content cluster structuring in the user-item embedding space.

These analyses confirm the suitability of VK-LSVD for probing both traditional collaborative filtering paradigms and advanced, context-aware, multi-modal recommendation tasks. The diversity and completeness of attributes facilitate the study of user preference evolution, session dynamics, and content cold-start.

Benchmarking and Evaluation Protocols

VK-LSVD provides subsets (e.g., 1% most active users/items) and utility sampling scripts, allowing researchers to adapt the data to realistic computational constraints while preserving ecosystem characteristics. Baseline experiments (Random, Global Popularity, Conversion, iALS) establish initial performance profiles for the dataset, with iALS achieving an NDCG@20 of 0.06554 on random splits, underscoring the challenge inherent in the data’s extreme sparsity, long-tail structure, and feedback diversity.

The live VK RecSys Challenge 2025 leverages VK-LSVD as its core dataset, with a distinct cold-start user ranking task evaluated via NDCG@100 per video and an enforced limit on recommendations per user, reflecting practical production constraints. With nearly 800 teams participating, anticipated results will provide a direct benchmark for evaluating sequential and content-based models under real-world constraints.

Practical and Theoretical Implications

VK-LSVD sets a new experimental standard for large-scale, multi-modal, and context-rich short-video recommendation research. Its design enables:

Benchmarking session-based, sequential, context-aware, and hybrid recommender models at scale.
Systematic analysis of cold-start and long-tail scenarios, including item and user-side cold-start, addressing limitations in earlier datasets.
Investigations of bias and fairness, given the inclusion of demographic fields and the potential for algorithmic disparate impact.

From a theoretical perspective, VK-LSVD facilitates research on compositional representations (via content embeddings), structural dynamics (demographic and content-driven latent spaces), and robust evaluation protocols leveraging temporally-disjoint splits.

Future Directions

VK-LSVD unlocks substantive opportunities in algorithmic development, fairness analysis, and application to novel tasks such as real-time engagement prediction, trend detection, and fair, privacy-preserving personalization. Anticipated directions include leveraging graph-based and transformer architectures for cross-modal sequence modeling, unbiased evaluation using realistic negative sampling, and domain-adaptive transfer scenarios.

VK-LSVD also motivates comparative studies with domain-specialized datasets (e.g., Yambda-5B for music, T-ECD for synthetic cross-domain tasks), investigating domain transfer and generalization in recommender systems with highly heterogeneous feedback and content encoding.

Conclusion

VK-LSVD provides a foundational infrastructure for short-video recommendation research, supporting scalable, realistic evaluations for next-generation personalized systems. Its combination of scale, feedback diversity, content embeddings, and contextual richness enables the community to address open challenges in sequential modeling, cold-start intervention, and fairness, while bridging the gap between academic research and industrial deployment (2602.04567).

Markdown Report Issue