Multimodal Large Models: Recent Advances, Industry Impact, and Challenges – An Expert Interview
In a detailed interview, Tsinghua researcher Zhao Sicheng and JD Retail senior director Peng Changping discuss the latest progress in multimodal large models, their practical applications in advertising and e‑commerce, persistent challenges such as hallucinations and data alignment, and the skills engineers need to thrive in the emerging AI era.
2024 continues to be dominated by AI, with OpenAI's 60‑second video generation model Sora sparking fresh discussions on multimodal large models. InfoQ invited Tsinghua associate researcher Zhao Sicheng, a globally recognized AI scholar, and Peng Changping, JD Retail senior technology director, to explore recent advances, industry impact, and future challenges.
Both experts agree that multimodal AIGC is reshaping advertising and retail experiences. Traditional ads rely on manual design, while multimodal models enable rapid, voice‑driven content creation, reducing cost and iteration time. However, current models suffer from hallucinations, limited Chinese language understanding, and weaker local object perception, especially in fine‑grained or sentiment analysis tasks.
Peng highlights the need to address hallucinations when applying large models to search and recommendation systems, mentioning retrieval‑augmented generation (RAG) and domain‑specific fine‑tuning (SFT) as partial solutions, yet reliability remains a major hurdle.
Data alignment across modalities is a core challenge: acquiring large‑scale, aligned multimodal datasets is difficult, and modeling images, video, and audio efficiently remains far behind text processing. Unified models that excel across diverse tasks are still lacking.
The interview also covers practical applications: multimodal sentiment analysis improves e‑commerce feedback accuracy; AI‑powered shopping assistants can personalize product displays and anticipate user needs, potentially transforming C‑end experiences, while B‑end scenarios may adopt these technologies earlier due to efficiency gains.
Both speakers stress the importance of rapid learning and interdisciplinary teams. Younger engineers should dive into papers and follow leading researchers, whereas senior staff should foster knowledge sharing and focus on high‑level direction. Core competencies in the AGI era include co‑design of data, algorithms, and compute, unsupervised task formulation, and large‑scale distributed training.
Finally, they discuss organizational implications: small, cross‑functional teams can iterate faster, and engineering focus should shift from handcrafted features (LR era) and model architecture tuning (DNN era) to data‑algorithm‑compute co‑design for large‑scale Transformers.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
