Artificial Intelligence 16 min read

Short Video Analysis in Local Life Scenarios: Techniques and Practices at Meituan

This article presents Meituan's AI-driven short video analysis workflow, covering industry trends, multi‑label video classification, intelligent cover selection, and video generation techniques, while discussing challenges, model building, label expansion, continuous data iteration, and future outlook for video AI in local services.

DataFunTalk
DataFunTalk
DataFunTalk
Short Video Analysis in Local Life Scenarios: Techniques and Practices at Meituan

Background – The rapid growth of video content, driven by advances in hardware and software, has created a massive amount of information in both user and content dimensions. AI algorithms can add value across creation, review, editing, and distribution of videos, especially in local‑life scenarios where Meituan operates.

Video Analysis Practices at Meituan

1. Multi‑label Video Classification – Traditional metadata‑only approaches are insufficient; Meituan builds a tag‑based understanding of video content to support operations, user profiling, search, recommendation, and advertising. Challenges include constructing a robust label taxonomy, ensuring accuracy and coverage, and enabling incremental learning. The initial model leverages the public YouTube‑8M dataset with aggregation‑style architectures, followed by a semi‑supervised teacher‑student pipeline to adapt to Meituan’s domain.

Label System Expansion – Horizontal expansion uses feature clustering and manual refinement, while vertical refinement adds fine‑grained food and scene tags using specialized image classification models. Continuous data iteration is achieved through active learning, confidence‑based sampling, and weak supervision from multimodal (visual‑text) signals.

2. Intelligent Video Cover – Covers act as the thumbnail for videos. Two scenarios are addressed: (a) generic covers selected by importance metrics (clarity, motion, information density) using either end‑to‑end or interpretable scoring pipelines; (b) semantic covers aligned with user intent, derived from weakly supervised segment‑level tagging that combines visual and textual cues.

3. Video Generation – Automated generation of short promotional videos for merchants (e.g., restaurants, hotels) involves AI‑driven material selection, quality assessment, deduplication, aesthetic ranking, smart cropping, and motion rendering. The pipeline processes multi‑modal inputs (images, video, audio, text) and produces a cohesive output for distribution.

Summary and Outlook – With continued advances in AI, 5G, and multimodal learning, video will play an increasingly pivotal role in local‑life services. Future work will focus on unsupervised/self‑supervised learning and deeper multimodal understanding to unlock more value from massive video data.

computer visionAIvideo generationvideo analysismulti-label classificationMeituanintelligent cover
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.