Artificial Intelligence 11 min read

How Kuaishou Achieved High‑Precision, Low‑Latency Danmu Blocking with AI

To prevent dense on‑screen comments from obscuring key video content, Kuaishou’s audio‑video team built a high‑precision, low‑latency intelligent danmu‑blocking system that uses advanced image‑segmentation masks, temporal stability enhancements, SSIM‑based scene detection, and a large‑scale annotated dataset to ensure robust, real‑time protection across diverse video scenarios.

Kuaishou Audio & Video Technology

Mar 10, 2022

How Kuaishou Achieved High‑Precision, Low‑Latency Danmu Blocking with AI

In the era of bullet‑screen (danmu) videos, dense comments often cover important scenes, degrading user experience. Kuaishou’s audio‑video team developed a high‑precision, low‑latency intelligent danmu‑blocking solution that automatically detects user‑interesting regions and routes danmu around them, enabling immersive viewing and interactive commenting simultaneously.

Background

Traditional adaptive danmu‑blocking methods rely on person‑masking, which can suffer from mis‑detections and latency, leading to visual artifacts such as mask flickering and incorrect blocking.

Improving Mask Precision

The team designed a high‑precision mask generation algorithm based on image‑segmentation networks (U2Net). To enhance temporal stability, they incorporated a non‑local module that aggregates features from previous frames, as shown in the diagram below.

Temporal Stability

By extracting features from the current frame and the preceding T‑1 frames, feeding them into the non‑local module, and using the first column of the output as the refined feature map, the mask becomes temporally consistent across consecutive frames.

Additionally, the previous frame’s mask is used as guidance to further strengthen stability.

Transition Stability

During scene transitions, relying solely on temporal information can cause mask lag. The team introduced an SSIM‑based switch: if the structural similarity between consecutive frames is high, temporal information is retained; otherwise, it is discarded, eliminating mask delay. The SSIM computation is optimized to run within 1 ms.

Scene Robustness

To cover the wide variety of user‑generated video scenarios, a comprehensive data‑annotation pipeline was built, encompassing data collection, filtering, multi‑model labeling, and quality assessment.

Targeted Scene Optimization

Human‑mask robustness was improved by training on a million‑scale dataset covering diverse scenes such as mukbang, street interviews, and movies. Background mis‑detections were reduced by collecting extensive samples of animals, plants, and natural landscapes and fine‑tuning the model accordingly.

Mask Delay Optimization

Two main causes of mask delay were identified:

Inconsistent video transcoding results across different bitrate streams.

Renderer lag where the mask for frame T‑1 is applied to frame T.

To resolve these, transcoding parameters were aligned to ensure identical timestamps across bitrate variants, and the player rendering pipeline was synchronized so that mask rendering keeps pace with video playback.

Results

Extensive testing across varied content (films, food, live interviews, multi‑person scenes, rapid cuts, large motions) showed a subjective accuracy exceeding 95 % for danmu‑blocking, confirming the effectiveness of the proposed enhancements.

References

[1] Qin X, Zhang Z, Huang C, et al. U2‑Net: Going deeper with nested U‑structure for salient object detection. Pattern Recognition , 2020, 106: 107404.

[2] Wang X, Girshick R, Gupta A, et al. Non‑local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018: 7794‑7803.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI image segmentation data annotation danmu blocking temporal stability

Written by

Kuaishou Audio & Video Technology

Explore the stories behind Kuaishou's audio and video technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.