Artificial Intelligence 18 min read

Overview of Meituan's Selected CVPR 2024 Papers and Online Sharing Event

The article highlights seven Meituan papers accepted at CVPR 2024—spanning OCR pre‑training, long‑tail semi‑supervised learning, visual AIGC, audio‑visual segmentation, open‑ended visual storytelling, and synthetic‑data‑enhanced object detection—and announces an online sharing session on June 27 where four authors will present their work.

Meituan Technology Team

Jun 13, 2024

Overview of Meituan's Selected CVPR 2024 Papers and Online Sharing Event

This article introduces seven papers authored by Meituan's technical team that were accepted at CVPR 2024, covering topics such as OCR pre‑training, long‑tail semi‑supervised learning, visual AIGC, and audio‑visual segmentation. The goal is to provide insights and inspiration for researchers working in related fields.

CVPR (IEEE Conference on Computer Vision and Pattern Recognition) is one of the three top conferences in computer vision, alongside ICCV and ECCV. According to the 2022 Google Scholar ranking, CVPR ranks fourth among all academic venues, following Nature, NEJM, and Science.

On June 27 (Thursday), four paper authors will give online talks. Details and registration links are provided at the end of this article.

01. ODM: A Text‑Image Further Alignment Pre‑training Approach for Scene Text Detection and Spotting

Authors: Chen Duan, Pei Fu, Shan Guo, Qianyi Jiang, Xiaoming Wei (Meituan)

Download: PDF

Abstract: Recent text‑image joint pre‑training has achieved strong performance, but aligning textual prompts with corresponding text regions in OCR remains challenging. Existing Masked Image Modeling (MIM) or Masked Language Modeling (MLM) methods have limitations. This work proposes OCR‑Text Destylization Modeling (ODM), which converts diverse text styles in images into a unified style guided by textual prompts, improving alignment and reducing annotation cost. Experiments on multiple benchmarks show significant gains in scene text detection and end‑to‑end recognition.

02. BEM: Balanced and Entropy‑based Mix for Long‑Tailed Semi‑Supervised Learning

Authors: Hongwei Zheng, Linyuan Zhou, Han Li (SJTU), Jinming Su, Xiaoming Wei, Xiaoming Xu (Meituan)

Download: PDF

Abstract: Long‑tail semi‑supervised learning (LTSSL) suffers from class imbalance and uncertainty. BEM introduces a balanced‑and‑entropy‑based mixing strategy that re‑balances both data quantity and class uncertainty via a class‑balanced mix library and entropy‑driven sampling, loss, and selection. Experiments demonstrate consistent performance improvements across benchmarks.

03. Animating General Image with Large Visual Motion Model (LVMM)

Authors: Dengsheng Chen, Xiaoming Wei, Xiaolin Wei (Meituan)

Abstract: Traditional optical‑flow‑based image animation is limited to specific scenarios. LVMM leverages diffusion models to predict complex scene motion. It consists of a neural rendering network (R), a flow prediction network (P), compression/reconstruction networks (E, D), and a latent diffusion model (e). The three‑stage training pipeline enables realistic motion synthesis for arbitrary scenes.

04. CustomListener: Text‑guided Responsive Interaction for User‑friendly Listening Head Generation

Authors: Xi Liu*, Ying Guo*, Cheng Zhen, Tong Li, Yingying Ao, Pengfei Yan (Meituan)

Download: PDF

Abstract: This work proposes CustomListener, allowing users to define Listener attributes via free‑form text (identity, personality, habits, relationships). The system combines the textual description with Speaker content to generate realistic Listener responses, using a ChatGPT‑derived static prior, a responsive interaction module (SDP), and a progressive generation module (PGG) for long‑video coherence.

05. Cooperation Does Matter: Exploring Multi‑Order Bilateral Relations for Audio‑Visual Segmentation

Authors: Qi Yang, Xing Nie, Tong Li (Meituan), Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan (Meituan), Shiming Xiang (UCAS, CASIA)

Download: PDF

Abstract: Introduces the audio‑visual segmentation (AVS) task, which requires pixel‑level segmentation of sounding objects. The proposed COMBO framework models three bilateral entanglements—pixel, modality, and temporal—using twin encoders, bilateral fusion, and adaptive frame‑consistency loss, achieving state‑of‑the‑art results on AVSBench datasets.

06. Intelligent Grimm – Open‑ended Visual Storytelling via Latent Diffusion Models

Authors: Chang Liu*, Haoning Wu*, Yujie Zhong, Xiaoyun Zhang, Yanfeng Wang, Weidi Xie (SJTU, Shanghai AI Lab)

Download: PDF

Abstract: Proposes StoryGen, a learning‑based autoregressive image generation model with a visual‑language context module for open‑ended visual storytelling. A large‑scale paired image‑text dataset (StorySalon) is constructed to train the model, which demonstrates superior performance in generating coherent image sequences with consistent characters.

07. InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

Authors: Chengjian Feng, Yujie Zhong, Zequn Jie (Meituan), Weidi Xie (SJTU), Lin Ma (Meituan)

Download: PDF

Abstract: Presents InstaGen, a synthetic data generation paradigm that integrates a grounding head into a pretrained diffusion model to produce instance‑level annotated images. The generated data improves object detection performance in both open‑vocabulary (+4.5 AP) and data‑scarce scenarios (+1.2~5.2 AP).

For more details and registration for the online sharing sessions, please visit the provided links.

deep learning AIGC Meituan CVPR 2024

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.