How Kuaishou Achieved High‑Precision, Low‑Latency Danmu Blocking with AI
To prevent dense on‑screen comments from obscuring key video content, Kuaishou’s audio‑video team built a high‑precision, low‑latency intelligent danmu‑blocking system that uses advanced image‑segmentation masks, temporal stability enhancements, SSIM‑based scene detection, and a large‑scale annotated dataset to ensure robust, real‑time protection across diverse video scenarios.
In the era of bullet‑screen (danmu) videos, dense comments often cover important scenes, degrading user experience. Kuaishou’s audio‑video team developed a high‑precision, low‑latency intelligent danmu‑blocking solution that automatically detects user‑interesting regions and routes danmu around them, enabling immersive viewing and interactive commenting simultaneously.
Background
Traditional adaptive danmu‑blocking methods rely on person‑masking, which can suffer from mis‑detections and latency, leading to visual artifacts such as mask flickering and incorrect blocking.
Improving Mask Precision
The team designed a high‑precision mask generation algorithm based on image‑segmentation networks (U2Net). To enhance temporal stability, they incorporated a non‑local module that aggregates features from previous frames, as shown in the diagram below.
Temporal Stability
By extracting features from the current frame and the preceding T‑1 frames, feeding them into the non‑local module, and using the first column of the output as the refined feature map, the mask becomes temporally consistent across consecutive frames.
Additionally, the previous frame’s mask is used as guidance to further strengthen stability.
Transition Stability
During scene transitions, relying solely on temporal information can cause mask lag. The team introduced an SSIM‑based switch: if the structural similarity between consecutive frames is high, temporal information is retained; otherwise, it is discarded, eliminating mask delay. The SSIM computation is optimized to run within 1 ms.
Scene Robustness
To cover the wide variety of user‑generated video scenarios, a comprehensive data‑annotation pipeline was built, encompassing data collection, filtering, multi‑model labeling, and quality assessment.
Targeted Scene Optimization
Human‑mask robustness was improved by training on a million‑scale dataset covering diverse scenes such as mukbang, street interviews, and movies. Background mis‑detections were reduced by collecting extensive samples of animals, plants, and natural landscapes and fine‑tuning the model accordingly.
Mask Delay Optimization
Two main causes of mask delay were identified:
Inconsistent video transcoding results across different bitrate streams.
Renderer lag where the mask for frame T‑1 is applied to frame T .
To resolve these, transcoding parameters were aligned to ensure identical timestamps across bitrate variants, and the player rendering pipeline was synchronized so that mask rendering keeps pace with video playback.
Results
Extensive testing across varied content (films, food, live interviews, multi‑person scenes, rapid cuts, large motions) showed a subjective accuracy exceeding 95 % for danmu‑blocking, confirming the effectiveness of the proposed enhancements.
References
[1] Qin X, Zhang Z, Huang C, et al. U2‑Net: Going deeper with nested U‑structure for salient object detection. Pattern Recognition , 2020, 106: 107404.
[2] Wang X, Girshick R, Gupta A, et al. Non‑local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018: 7794‑7803.
Kuaishou Audio & Video Technology
Explore the stories behind Kuaishou's audio and video technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.