Artificial Intelligence 14 min read

Xiaohongshu Team’s Four ICCV 2023 Papers on Open‑Vocabulary Video Instance Segmentation, One‑Shot 3D Avatar Learning, Test‑Time Personalized Human Pose Forecasting, and MPI‑Flow for Realistic Optical Flow

The Xiaohongshu technical team secured four ICCV 2023 papers—including an oral presentation—introducing an open‑vocabulary video instance segmentation benchmark and model, a one‑shot neural‑radiance‑field avatar method, a test‑time personalized 3D pose forecasting framework, and an MPI‑based realistic optical‑flow generation technique, all achieving state‑of‑the‑art performance.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Team’s Four ICCV 2023 Papers on Open‑Vocabulary Video Instance Segmentation, One‑Shot 3D Avatar Learning, Test‑Time Personalized Human Pose Forecasting, and MPI‑Flow for Realistic Optical Flow

Recently, the ICCV 2023 paper acceptance results were announced. The Xiaohongshu technical team has four papers accepted, including one Oral paper (Oral acceptance rate only 1.88%). The research topics cover video instance segmentation, 3D digital human reconstruction, human motion prediction, and video analysis.

ICCV, organized by IEEE, is one of the three top conferences in computer vision and is held biennially. In 2023 the conference will take place in Paris, France. A total of 8,068 papers were submitted worldwide, and 2,161 were accepted, yielding an overall acceptance rate of 26.78%.

Below are the highlights of the four selected papers.

Towards Open‑Vocabulary Video Instance Segmentation (Oral) Authors: Wang Haocen (Xiaohongshu intern & University of Amsterdam), Lei Ge (Xiaohongshu), Tang Shen (Xiaohongshu), Xia Hou (Xiaohongshu) and others. We are the first to extend video instance segmentation from a few closed‑set categories to unrestricted open‑vocabulary categories. We introduce a new benchmark (LV‑VIS) containing precise annotations for 1,196 diverse categories and provide a baseline model (OV2Seg) based on a memory‑guided Transformer architecture that runs at near‑real‑time speed and demonstrates strong zero‑shot generalization on unseen classes.

One‑shot Implicit Animatable Avatars with Model‑based Priors Authors: Huang Yangyi (Xiaohongshu intern & Zhejiang University), Wang Haofan (Xiaohongshu), Zhang Debing (Xiaohongshu) and others. We propose ELICIT, a method that learns a person‑specific neural radiance field from a single image by leveraging two priors: a 3D geometric prior from the SMPL body model and a visual‑semantic prior from a CLIP pre‑trained model. A segmentation‑based sampling strategy further refines local details. Extensive experiments on ZJUMoCAP, Human3.6M and DeepFashion show that ELICIT outperforms existing baselines when only one image is available.

Test‑time Personalizable Forecasting of 3D Human Poses Authors: Cui Qiongjie (Xiaohongshu intern & Nanjing University of Science & Technology), Wang Haofan (Xiaohongshu) and others. We introduce Helper‑Predictor (H/P‑TTP), a test‑time personalization framework that adapts a 3D pose forecasting model to unseen subjects without additional training data. The system combines explicit and implicit enhancers (noise injection and adversarial data generation) to obtain subject‑specific parameters, achieving significant accuracy gains on H3.6M, GRAB and HumanEva‑I datasets.

MPI‑Flow: Learning Realistic Optical Flow with Multiplane Images Authors: Liang Yingping (Xiaohongshu intern & Beijing Institute of Technology), Liu Jiaming (Xiaohongshu), Zhang Debing (Xiaohongshu) and others. We propose a method that generates realistic optical‑flow datasets from a single real‑world image by constructing a multilayer depth representation (MPI). A camera‑motion‑aware flow computation per plane and an independent object‑motion module produce accurate flow fields. A depth‑aware refinement module resolves occlusion artifacts. The approach yields state‑of‑the‑art performance on both supervised and unsupervised optical‑flow training, greatly improving generalization to real‑world scenes.

QR codes in the original page provide direct access to the papers and source code for each work.

computer vision3D avatarHuman Pose ForecastingICCV 2023optical flowVideo Instance Segmentation
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.