Artificial Intelligence 7 min read

KuaiRec: A 99.6% Dense Short‑Video Recommendation Dataset for Unbiased and Interactive Recommendation Research

The article introduces KuaiRec, a densely observed short‑video recommendation dataset with 99.6% density covering 1,411 users and 3,327 videos, discusses its structure, advantages over sparse public datasets, and its applicability to unbiased, interactive, conversational and reinforcement‑learning based recommendation studies.

DataFunSummit

Mar 15, 2022

KuaiRec: A 99.6% Dense Short‑Video Recommendation Dataset for Unbiased and Interactive Recommendation Research

This week we share a resource paper from Kuaishou and USTC that releases KuaiRec, a nearly fully observed dense dataset containing interactions of 1,411 users with 3,327 short videos, achieving an extraordinary 99.6% density (most public recommendation datasets are below 1%). The dataset can be used for offline A/B testing as well as for unbiased recommendation, interactive/conversational recommendation, and reinforcement‑learning‑based recommendation research.

Paper: https://arxiv.org/abs/2202.10842 Data: https://rec.ustc.edu.cn/share/598635c0-9585-11ec-8259-414ede1f8d4f Code: http://m6z.cn/5U6xyQ

Most offline recommendation datasets suffer from high sparsity (often <1%) and various biases, which severely affect evaluation performance. Existing mitigation methods rely on random sampling of interactions (e.g., Yahoo, Coat), but they still inherit sparsity‑induced bias. KuaiRec, collected from Kuaishou’s short‑video platform, is the first dataset with a density of 99%.

The dataset provides two scales: a Small matrix with 99.6% density for trustworthy evaluation, and a Big matrix with 13.4% density for model training. The two matrices have no overlap in users or items.

Statistical information of the dataset is shown in the table below (see image). The Big matrix also includes rich side information such as user social networks and item features.

Because the data almost covers every user‑item interaction, missing‑value handling is unnecessary; the dataset is suitable for unbiased recommendation, interactive recommendation, and conversational recommendation, enabling efficient offline A/B testing.

Experiments that compare a partially observed subset (derived from the Small matrix) with the fully observed dataset reveal two key findings: (1) bias dramatically influences model performance and ranking; (2) different data densities still lead to inconsistent results.

The original dataset is explicit feedback. To convert it to implicit feedback for ranking tasks, the authors recommend treating a video view longer than twice the video length (i.e., the user watched the video at least twice completely) as a positive sample.

Using a dialogue recommendation scenario, the paper further evaluates various algorithms on KuaiRec, encouraging readers to explore the experimental settings in detail.

Finally, the authors hope KuaiRec becomes a testing platform for future research. Potential uses include building trustworthy user simulators from partially observed data, and employing the Small matrix as a benchmark for studies on bias, interactive recommendation, and evaluation. By releasing a fully observed dataset, they aim to motivate the community to collect richer, more complete data to advance recommendation research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning Recommendation Systems dense dataset interactive recommendation KuaiRec unbiased recommendation

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.