KuaiRec: A 99.6% Dense Short‑Video Recommendation Dataset for Unbiased and Interactive Recommendation Research
The article introduces KuaiRec, a densely observed short‑video recommendation dataset with 99.6% density covering 1,411 users and 3,327 videos, discusses its structure, advantages over sparse public datasets, and its applicability to unbiased, interactive, conversational and reinforcement‑learning based recommendation studies.
This week we share a resource paper from Kuaishou and USTC that releases KuaiRec, a nearly fully observed dense dataset containing interactions of 1,411 users with 3,327 short videos, achieving an extraordinary 99.6% density (most public recommendation datasets are below 1%). The dataset can be used for offline A/B testing as well as for unbiased recommendation, interactive/conversational recommendation, and reinforcement‑learning‑based recommendation research.
Paper: https://arxiv.org/abs/2202.10842 Data: https://rec.ustc.edu.cn/share/598635c0-9585-11ec-8259-414ede1f8d4f Code: http://m6z.cn/5U6xyQ
Most offline recommendation datasets suffer from high sparsity (often <1%) and various biases, which severely affect evaluation performance. Existing mitigation methods rely on random sampling of interactions (e.g., Yahoo, Coat), but they still inherit sparsity‑induced bias. KuaiRec, collected from Kuaishou’s short‑video platform, is the first dataset with a density of 99%.
The dataset provides two scales: a Small matrix with 99.6% density for trustworthy evaluation, and a Big matrix with 13.4% density for model training. The two matrices have no overlap in users or items.
Statistical information of the dataset is shown in the table below (see image). The Big matrix also includes rich side information such as user social networks and item features.
Because the data almost covers every user‑item interaction, missing‑value handling is unnecessary; the dataset is suitable for unbiased recommendation, interactive recommendation, and conversational recommendation, enabling efficient offline A/B testing.
Experiments that compare a partially observed subset (derived from the Small matrix) with the fully observed dataset reveal two key findings: (1) bias dramatically influences model performance and ranking; (2) different data densities still lead to inconsistent results.
The original dataset is explicit feedback. To convert it to implicit feedback for ranking tasks, the authors recommend treating a video view longer than twice the video length (i.e., the user watched the video at least twice completely) as a positive sample.
Using a dialogue recommendation scenario, the paper further evaluates various algorithms on KuaiRec, encouraging readers to explore the experimental settings in detail.
Finally, the authors hope KuaiRec becomes a testing platform for future research. Potential uses include building trustworthy user simulators from partially observed data, and employing the Small matrix as a benchmark for studies on bias, interactive recommendation, and evaluation. By releasing a fully observed dataset, they aim to motivate the community to collect richer, more complete data to advance recommendation research.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.