Artificial Intelligence 15 min read

DeepHash: Large-Scale Multimedia Content Analysis and Retrieval for Short Video Platforms

DeepHash is Meitu’s large‑scale short‑video analysis and retrieval system that converts deep‑learned visual features into compact binary hash codes via a MobileNet‑based CNN and triplet‑loss training, enabling fast, robust similarity search across billions of videos with sub‑second latency and minimal storage.

Meitu Technology
Meitu Technology
Meitu Technology
DeepHash: Large-Scale Multimedia Content Analysis and Retrieval for Short Video Platforms

Meitu, a company with massive multimedia data, faces the challenge of efficiently analyzing and extracting useful information from this data. Using the Meipai short‑video business as an example, the article introduces the exploration and practice of large‑scale short‑video content analysis and retrieval.

Multimedia similarity retrieval is described as extracting feature representations from different media and then searching and ranking them in the corresponding feature space. Two kinds of feature representations are discussed: traditional visual features (e.g., key‑point descriptors, color histograms) and deep‑learning‑based semantic (deep) features. DeepHash, a large‑scale multimedia retrieval system based on deep hashing, consists of two major modules: algorithms and services.

The article critiques tag‑based video description, pointing out its limited information, discreteness, and inability to capture fine‑grained details. Human description relies on rich visual features, which are continuous and more expressive.

Feature‑based video hashing offers three key properties: diversity (richer, multi‑dimensional information), robustness (similar videos yield similar features), and computable distance (feature similarity can be measured directly).

Two common feature encoding formats are presented: floating‑point and binary. Binary features provide significant advantages in storage efficiency and retrieval speed (using Hamming distance). Consequently, the system adopts binary hash codes.

The hash‑feature extraction pipeline is as follows: a convolutional neural network extracts video features, maps them to a fixed‑length floating‑point vector, passes the vector through a sigmoid layer to obtain values in [0,1], and then thresholds them to produce binary codes.

Training employs supervised learning with a Triplet loss to enhance feature expressiveness. The network uses a cascaded architecture with shared features, leveraging MobileNet as the backbone. Five frames per video are sampled, achieving up to 100 videos per second inference on a Titan X GPU.

To reduce labeling cost, an automatic labeling strategy is used: a small manually labeled set trains a classifier, confidence thresholds (>99%) determine reliable predictions, and uncertain samples are marked with a placeholder.

Multi‑label joint training and shared‑feature techniques are applied to handle the high dimensionality of the multi‑level label system while keeping model complexity manageable.

During training, Triplet loss requires careful selection of positive and negative samples. Positive samples are obtained by extracting interval frames from the same video, while negatives come from different categories, enabling convergence without additional annotation.

For prediction, a category mask is applied to the hash code to hide low‑contribution bits and retain important ones, improving retrieval accuracy.

The service side consists of offline tasks (model training and bulk hash generation) and online tasks (real‑time query handling). Offline tasks periodically retrain the model to accommodate the timeliness of user‑generated content, then regenerate hash codes for historical videos. Online tasks first check if a query video’s hash exists in the feature library; if not, the model predicts a new hash, which is then used for retrieval and added to the library.

System evolution is described across three versions: V1.0 (single‑node, up to millions of features), V2.0 (support for audio features, unified indexing, asynchronous frame extraction), and V3.0 (containerized cluster supporting billions of features with a clustered retrieval architecture).

Performance figures show that a 128‑bit hash can represent a video, requiring only 1.5 GB to store 100 million videos. Retrieval latency is 0.35 s for 10 million videos using 8 instances, and 3 s for 300 million videos using 50 instances.

The outlook highlights that DeepHash is a generic multimedia retrieval system; future work includes adding image and text modalities and extending to content moderation.

Author: Liu Xu, Meitu Cloud Computer Vision R&D Engineer, MSc from the University of Edinburgh, former researcher at the National Meteorological Satellite Center, focusing on object detection, image/video hashing, and retrieval technologies.

deep learningFeature Extractionlarge scalemultimedia retrievaltriplet lossvideo hashing
Meitu Technology
Written by

Meitu Technology

Curating Meitu's technical expertise, valuable case studies, and innovation insights. We deliver quality technical content to foster knowledge sharing between Meitu's tech team and outstanding developers worldwide.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.