Short Video Content Understanding and Generation Practices at Meituan
Meituan leverages computer‑vision techniques to tag, analyze, and automatically generate short videos across consumer and merchant scenarios, detailing hierarchical tag design, self‑supervised representation learning, fine‑grained food recognition, intelligent cover creation, and pixel‑level editing to enhance content discovery and presentation.
Background
Meituan has accumulated massive video data from local life service e‑commerce scenarios. Short videos provide richer information than text or images, enabling diverse content presentation for users and merchants.
Short Video Content Understanding
Video Tagging
Goal: summarize key concepts in a video and expose its “black box” to downstream applications. Two forms are used: explicit textual tags produced by classification models and implicit vector embeddings for recommendation or search.
Explicit tags are organized into a hierarchical taxonomy (theme → scene → fine‑grained entity) to support operations such as high‑value content selection.
Example: a food‑exploration video uses three‑level tags—theme “food exploration”, scene “indoor”/“outdoor”, entity “Kung Pao chicken”. Defining useful tags requires joint input from product, operations and algorithm teams and poses visual modeling challenges.
Base Representation Learning
A two‑pronged solution improves (1) generic base representations and (2) label‑specific classification performance. Self‑supervised pre‑training on Meituan video data yields features better aligned with business distribution than models pre‑trained on public datasets.
Weakly supervised signals are mined from user reviews: a review mentioning “grilled meat” automatically labels the associated video “烤肉”. A teacher model generates pseudo‑labels, which are filtered by confidence and combined with incremental data to train a student model. Iterative data updates provide larger gains than architecture changes.
Model Iteration
For a target tag (e.g., “food exploration”), a small set of positive samples is manually annotated, then the base model is fine‑tuned. Offline pre‑filtering reduces the number of videos an annotator must examine from hundreds to a few, dramatically improving labeling efficiency.
Online, high‑confidence predictions are auto‑recycled into training data; low‑confidence or noisy predictions are filtered by confidence‑learning or sent to human reviewers in an active‑learning loop.
Applications of Theme Tags
In the Dianping app, videos tagged with “food exploration” are selected for the “达人探店” tab, providing immersive previews for users and exposure for merchants.
Fine‑Grained Food Recognition
A stacked global‑local attention network distinguishes visually similar dishes, achieving significant improvements. Cross‑domain differences between static food images and video frames are addressed by kernel‑norm maximization and knowledge distillation. The same technology powered a Large‑Scale Fine‑Grained Food Analysis competition (ICCV 2021) with 1 500 Chinese dish classes. Competition details: https://foodai-workshop.meituan.com/foodai2021.html#index. The method was published at ACM MM (ISIA Food‑500): https://dl.acm.org/doi/10.1145/3394171.3414031.
Fine‑Grained Tag‑Driven Cover Selection
When a user searches “hotpot”, the system extracts key frames, filters by quality, and uses fine‑grained dish tags to choose a cover that matches the query, improving search experience compared with generic person‑centric covers.
Richer Video Segment Tag Mining
Joint modeling of visual and textual modalities from user notes discovers additional segment tags, revealing long‑tail distributions and fine‑grained concepts such as “silk‑painting scarf”.
Short Video Content Generation
Image‑to‑Video (Restaurant)
Given an image album, the pipeline removes low‑quality pictures, performs content and quality analysis, crops images based on aesthetic scores, and applies Ken‑Burns and transition effects to produce a polished food video.
Image‑to‑Video (Hotel)
Hotel albums are converted into preview videos by following designer‑provided script templates; the algorithm selects images that match template slots and adds audio and transitions.
Video‑to‑Video (Clip Extraction)
Long videos are segmented into candidate clips using temporal segmentation, then ranked by (1) generic quality (clarity, aesthetic score) and (2) semantic relevance (e.g., dish presentation). The top clips become smart covers or short highlights.
Pixel‑Level Editing
Semantic segmentation underlies pixel‑level effects. Meituan improved BiSeNet by adding a detail‑guided Stage 3 trained with Laplacian‑derived detail ground truth, DICE + BCE loss, and a lightweight STDCNet backbone. Experiments show better preservation of high‑frequency details.
Conclusion and Outlook
The techniques—video tagging, fine‑grained recognition, intelligent cover generation, and pixel‑level editing—demonstrate how AI can enrich information display for consumers and merchants. Future work will explore multimodal self‑supervised training to reduce annotation dependence and improve generalization across diverse business scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
