Artificial Intelligence 20 min read

Short Video Content Understanding and Generation Practices at Meituan

Meituan leverages computer‑vision techniques to tag, analyze, and automatically generate short videos across consumer and merchant scenarios, detailing hierarchical tag design, self‑supervised representation learning, fine‑grained food recognition, intelligent cover creation, and pixel‑level editing to enhance content discovery and presentation.

Meituan Technology Team

Apr 14, 2022

Short Video Content Understanding and Generation Practices at Meituan

Background

Meituan has accumulated massive video data from local life service e‑commerce scenarios. Short videos provide richer information than text or images, enabling diverse content presentation for users and merchants.

Short Video Content Understanding

Video Tagging

Goal: summarize key concepts in a video and expose its “black box” to downstream applications. Two forms are used: explicit textual tags produced by classification models and implicit vector embeddings for recommendation or search.

Explicit tags are organized into a hierarchical taxonomy (theme → scene → fine‑grained entity) to support operations such as high‑value content selection.

Example: a food‑exploration video uses three‑level tags—theme “food exploration”, scene “indoor”/“outdoor”, entity “Kung Pao chicken”. Defining useful tags requires joint input from product, operations and algorithm teams and poses visual modeling challenges.

Base Representation Learning

A two‑pronged solution improves (1) generic base representations and (2) label‑specific classification performance. Self‑supervised pre‑training on Meituan video data yields features better aligned with business distribution than models pre‑trained on public datasets.

Weakly supervised signals are mined from user reviews: a review mentioning “grilled meat” automatically labels the associated video “烤肉”. A teacher model generates pseudo‑labels, which are filtered by confidence and combined with incremental data to train a student model. Iterative data updates provide larger gains than architecture changes.

Model Iteration

For a target tag (e.g., “food exploration”), a small set of positive samples is manually annotated, then the base model is fine‑tuned. Offline pre‑filtering reduces the number of videos an annotator must examine from hundreds to a few, dramatically improving labeling efficiency.

Online, high‑confidence predictions are auto‑recycled into training data; low‑confidence or noisy predictions are filtered by confidence‑learning or sent to human reviewers in an active‑learning loop.

Applications of Theme Tags

In the Dianping app, videos tagged with “food exploration” are selected for the “达人探店” tab, providing immersive previews for users and exposure for merchants.

Fine‑Grained Food Recognition

A stacked global‑local attention network distinguishes visually similar dishes, achieving significant improvements. Cross‑domain differences between static food images and video frames are addressed by kernel‑norm maximization and knowledge distillation. The same technology powered a Large‑Scale Fine‑Grained Food Analysis competition (ICCV 2021) with 1 500 Chinese dish classes. Competition details: https://foodai-workshop.meituan.com/foodai2021.html#index. The method was published at ACM MM (ISIA Food‑500): https://dl.acm.org/doi/10.1145/3394171.3414031.

Fine‑Grained Tag‑Driven Cover Selection

When a user searches “hotpot”, the system extracts key frames, filters by quality, and uses fine‑grained dish tags to choose a cover that matches the query, improving search experience compared with generic person‑centric covers.

Richer Video Segment Tag Mining

Joint modeling of visual and textual modalities from user notes discovers additional segment tags, revealing long‑tail distributions and fine‑grained concepts such as “silk‑painting scarf”.

Short Video Content Generation

Image‑to‑Video (Restaurant)

Given an image album, the pipeline removes low‑quality pictures, performs content and quality analysis, crops images based on aesthetic scores, and applies Ken‑Burns and transition effects to produce a polished food video.

Image‑to‑Video (Hotel)

Hotel albums are converted into preview videos by following designer‑provided script templates; the algorithm selects images that match template slots and adds audio and transitions.

Video‑to‑Video (Clip Extraction)

Long videos are segmented into candidate clips using temporal segmentation, then ranked by (1) generic quality (clarity, aesthetic score) and (2) semantic relevance (e.g., dish presentation). The top clips become smart covers or short highlights.

Pixel‑Level Editing

Semantic segmentation underlies pixel‑level effects. Meituan improved BiSeNet by adding a detail‑guided Stage 3 trained with Laplacian‑derived detail ground truth, DICE + BCE loss, and a lightweight STDCNet backbone. Experiments show better preservation of high‑frequency details.

Conclusion and Outlook

The techniques—video tagging, fine‑grained recognition, intelligent cover generation, and pixel‑level editing—demonstrate how AI can enrich information display for consumers and merchants. Future work will explore multimodal self‑supervised training to reduce annotation dependence and improve generalization across diverse business scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision self-supervised learning short video Semantic Segmentation video tagging AI content generation fine-grained recognition

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.