Artificial Intelligence 19 min read

Overview of Recent Meituan Visual Intelligence Research Papers on Content Production, Distribution, and Model Quantization

Meituan’s Visual Intelligence team recently published eight top‑conference papers that advance weakly supervised segmentation, future‑aware captioning, panoptic narrative grounding, video‑text retrieval, open‑vocabulary detection, counterfactual image‑text matching, zero‑shot video classification, and efficient Vision‑Transformer quantization, all directly boosting real‑world content creation, distribution, and model efficiency.

Meituan Technology Team

Nov 17, 2022

Artificial intelligence is becoming a core engine for the content industry, with visual AI permeating content creation, review, distribution, user interaction, and monetization. Meituan's Visual Intelligence Department recently had eight papers accepted at top multimedia and computer‑vision conferences (ACM MM and ECCV). This article summarizes the research contributions and their practical applications.

Content Production

Adaptive Spatial‑BCE Loss for Weakly Supervised Semantic Segmentation (ECCV) – Authors: Wu Tong, Gao Guangyu, Huang Junshi, Wei Xiaoming, Wei Xiaolin, Liu Chi. PDF . The paper proposes a spatial binary cross‑entropy loss that assigns different optimization directions to foreground and background pixels, producing clearer pseudo‑label contours. Experiments on PASCAL VOC 2012 and MS‑COCO 2014 achieve state‑of‑the‑art performance without complex post‑processing, enabling efficient advertising material parsing and product white‑background generation.

Efficient Modeling of Future Context for Image Captioning (ACM MM) – Authors: Fei Zhengcong, Huang Junshi, Wei Xiaoming, Wei Xiaolin. PDF . The work integrates non‑autoregressive mask‑based modeling into autoregressive captioning, allowing the model to leverage future context without extra inference cost, improving caption quality for advertising copy and product descriptions.

Content Distribution

PPMN: Pixel‑Phrase Matching Network for One‑Stage Panoptic Narrative Grounding (ACM MM) – Authors: Ding Zihan, Hui Tianrui, Huang Junshi, Wei Xiaoming, Wei Xiaolin, Liu Si. PDF . The single‑stage network directly matches each phrase with its corresponding pixels, overcoming the limitations of two‑stage methods and enabling fine‑grained multimodal alignment for user‑comment tagging and cross‑modal retrieval.

Concept Propagation via Attentional Knowledge Graph Reasoning for Video‑Text Retrieval (ACM MM) – Authors: Fang Sheng, Wang Shuhui, Zhuo Junbao, Huang Qingming, Ma Bin, Wei Xiaoming, Wei Xiaolin. PDF . By incorporating hierarchical concept propagation guided by external knowledge, the method captures fine‑grained video‑text semantics and improves retrieval performance across multiple benchmarks.

PromptDet: Towards Open‑Vocabulary Detection using Uncurated Images (ACM MM) – Authors: Feng Chengjian, Zhong Yujie, Xie Zequn, Chu Xiangxiang, Ren Haibing, Wei Xiaolin, Ma Lin. PDF . The approach aligns region proposals with a pretrained vision‑language text encoder and uses prompt learning to achieve open‑vocabulary detection without manual annotations.

Synthesizing Counterfactual Samples for Effective Image‑Text Matching (ACM MM) – Authors: Wei Hao, Wang Shuhui, Han Xinzhe, Xue Zhe, Ma Bin, Wei Xiaoming, Wei Xiaolin. PDF . The Counterfactual Matching (CFM) framework generates hard negative samples via causal reasoning, enhancing fine‑grained image‑text alignment and benefiting downstream tasks such as multimodal retrieval.

Zero‑Shot Video Classification with Appropriate Web and Task Knowledge Transfer (ACM MM) – Authors: Zhuo Junbao, Zhu Yan, Cui Shuhao, Wang Shuhui, Huang Qingming, Ma Bin, Wei Xiaoming, Wei Xiaolin. PDF . The paper builds attribute‑class relations from web‑collected images and external knowledge, using graph neural networks to enable accurate zero‑shot video classification.

Model Quantization

Towards Accurate Post‑Training Quantization for Vision Transformer (ACM MM) – Authors: Ding Yifu, Qin Haotong, Yan Qinghua, Chai Zhenhua, Liu Junjie, Wei Xiaolin, Liu Xianglong. PDF . The APQ‑ViT framework introduces block‑wise error calibration and a Matthew‑effect‑preserving softmax quantization, achieving near‑lossless 8‑bit quantization and significant accuracy retention even at 4/6‑bit precision for Vision Transformers.

Overall, the article showcases Meituan Visual Intelligence’s advances in multimodal understanding, generation, segmentation, detection, and model compression, highlighting how these research outcomes are applied to real‑world content production and distribution scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI research Image Captioning Semantic Segmentation Model Quantization video-text retrieval Open-Vocabulary Detection

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.