Artificial Intelligence 10 min read

iQIYI AI Bullet‑Screen Masking: Semantic Segmentation System and Engineering Insights

iQIYI’s bullet‑screen masking employs a DeepLabv3+‑based two‑class semantic segmentation pipeline, preceded by a close‑up detector and followed by morphological refinement, trained on a custom annotated dataset that raises IoU to 93.6 %, processes hour‑long videos in under an hour, and is slated for future upgrades to instance and panoptic segmentation for finer‑grained masking.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI AI Bullet‑Screen Masking: Semantic Segmentation System and Engineering Insights

Editors and AI enthusiasts often claim they can manually distinguish between "artificial intelligence" and "artificial stupidity". When presented with a screenshot from the iQIYI app showing a TV show where the on‑screen comments (bullet screens) avoid the host’s face, many admitted the confidence was low.

The phenomenon of "bullet screens covering faces" is common in popular videos, but the iQIYI example demonstrates a sophisticated algorithm that automatically generates a mask to keep comments away from the presenter’s face.

In academia, image segmentation still lags behind object detection, even though many research teams report human‑level performance on detection tasks. Google’s DeepLabv3+ model, pre‑trained on 300 million internal images, achieved state‑of‑the‑art results on PASCAL VOC (IoU 89 %) and Cityscapes (IoU 82.1 %).

Given this research level, the question arises: is the iQIYI "bullet‑screen mask" based on AI or manual rules? The answer is that it is a semantic segmentation system built on DeepLabv3+.

The system performs a two‑class semantic segmentation (foreground vs. background) for each pixel, producing a mask file. The pipeline includes a scene‑type classifier that first determines whether a frame is a close‑up (near‑shot) or a wide‑shot; only close‑ups are passed to the segmentation model, preventing mask jitter on distant shots.

After segmentation, morphological operations such as erosion and dilation refine the foreground region, and small foreground blobs are removed according to application needs before the mask is compressed and stored.

Training data: several tens of thousands of manually annotated frames were collected from iQIYI’s own variety shows (e.g., "China’s New Rap" Season 1 and "Hot Blood Street Dance Team"). General‑purpose datasets like MS‑COCO are insufficient for this domain, so a dedicated dataset was built, leading to an IoU increase from 87.6 % to 93.6 % after fine‑tuning.

Inference speed: on a single GPU, a one‑minute video segment takes a few minutes to process. In production, multiple GPUs and video chunking allow a 90‑minute video to be processed in about 40 minutes, meeting tight broadcast deadlines.

Future upgrades include moving from semantic to instance or panoptic segmentation (e.g., fan‑specific masks using Mask‑RCNN and face recognition), refining segmentation granularity (foreground vs. background vs. out‑of‑focus regions), and extending the technology to product logo masking, object extraction, and mobile‑device acceleration.

Overall, the iQIYI bullet‑screen masking system demonstrates a pragmatic, engineering‑driven approach: using a robust but not perfect segmentation model, limiting its application to scenarios where coarse masks suffice, and supplementing it with classification and post‑processing to achieve high‑quality user experience.

AIDeep Learningvideo processingsemantic segmentationiQIYIbullet screen masking
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.