Artificial Intelligence 19 min read

Meituan's AI-Powered Image Intelligent Review System: Watermark Detection, Celebrity Face Recognition, Pornography Detection, and Scene Classification

This article describes Meituan's large‑scale AI‑driven image moderation platform, detailing deep‑learning based watermark detection, celebrity face recognition, pornographic image detection, and scene classification techniques, along with system architecture, data preparation, model evaluation, and deployment considerations.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Meituan's AI-Powered Image Intelligent Review System: Watermark Detection, Celebrity Face Recognition, Pornography Detection, and Scene Classification

Meituan leverages AI across many of its services, and its image intelligent review system tackles the massive daily volume of user‑uploaded pictures by automatically filtering illegal or non‑compliant content.

Background

Manual review is costly and inconsistent; therefore, a machine‑learning solution is needed to achieve high accuracy and automation rates.

Image intelligent review uses image processing and machine learning to classify pictures as either negative (violating) or positive (acceptable), with uncertain cases sent to human reviewers.

The system consists of a negative‑example filter followed by a positive‑example filter, reducing the workload for manual review.

Both filtering modules involve detection, classification, and recognition, with deep learning as the preferred technology.

Deep‑Learning Based Watermark Detection

Watermarks vary in style, location, size, and background complexity, making detection challenging.

Traditional sliding‑window methods are inefficient; modern approaches use region proposal networks (R‑CNN series) or single‑shot detectors (SSD, YOLO) to reduce computation.

Meituan adopted an SSD framework with a ResNet backbone, training on 25 watermark categories (15,000 images) augmented by random cropping and background synthesis.

Evaluation on 3,197 online images showed SSD outperforms traditional handcrafted‑feature methods in both recall and precision, especially for rare watermarks.

Celebrity Face Recognition

The goal is to detect celebrity faces to avoid infringement; the pipeline includes face detection, landmark detection, alignment, feature extraction, and similarity comparison.

Meituan uses Faster R‑CNN with hard‑negative mining, multi‑scale training, and context fusion for detection, and a two‑stage training strategy (Softmax + CenterLoss) on Inception‑v3 for recognition, fine‑tuned on a proprietary dataset of 5,200 celebrity IDs.

Ensemble learning with ten region‑specific models further improves accuracy, achieving competitive results on the LFW benchmark.

Pornographic Image Detection

Using a refined multi‑class model (porn, sexy, normal person, other), Meituan improves recall over the Yahoo NSFW baseline.

The system classifies images into "definite porn", "definite non‑porn", and "suspect"; suspects are ranked by confidence and sent for human review, achieving >99% precision for the first two categories while only 3% of images require manual verification.

Video moderation is handled by extracting key frames and applying the same image model.

Scene Classification

Meituan categorizes images across its diverse business verticals (food, travel, etc.) to align with merchant categories and improve presentation.

Transfer learning fine‑tunes deep CNNs (e.g., ResNet) on limited labeled data, freezing shallow layers and retraining deeper layers for specific scene categories such as cuisine types and hotel room styles.

Experiments on food and hotel scene datasets with tens of thousands of images achieve high accuracy, as shown in the evaluation tables.

Overall, deep learning‑based detection and classification have replaced traditional methods in Meituan's image intelligent review pipeline, enabling large‑scale, high‑precision moderation across multiple business scenarios.

References

[1] H. Chen et al., "Robust text detection in natural images with edge‑enhanced maximally stable extremal regions," ICIP 2011. [2] Z. Zhong et al., "DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images," Architecture Science 2015. [3] M. Liao et al., "TextBoxes: A Fast Text Detector with a Single Deep Neural Network," AAAI 2017. [4] S. Ren et al., "Faster R‑CNN: Towards real‑time object detection with region proposal networks," NIPS 2015. [5] A. Graves et al., "Connectionist temporal classification," ICML 2006. [6] R. Girshick et al., "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," CVPR 2014. [7] J. Redmon et al., "You Only Look Once: Unified, Real‑time Object Detection," CVPR 2016. [8] W. Liu et al., "SSD: Single Shot MultiBox Detector," ECCV 2016. [9] ... (additional references omitted for brevity)

computer visiondeep learningface recognitionImage Moderationporn detectionscene classificationwatermark detection
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.