Exploration and Practice of Multimodal Large Models at 360
This article presents 360's comprehensive exploration of image‑text multimodal large models, covering background concepts, research routes, three generations of model development, proprietary architectures like SEEChat, 360VL and Inner‑Adaptor, and real‑world AI applications across various products and services.
With the rapid development of large model technology, image‑text multimodal data is increasingly applied in the Internet, and this article shares 360's exploration and practice of multimodal large models.
It first introduces the background, defining what large models are and what is needed in the AI era, and discusses the rise of visual multimodal capabilities exemplified by GPT‑4, GPT‑4V and GPT‑4O.
The article then describes the fundamentals of multimodal large models (LMM/MLLM), the two main research routes—native multimodal designs (e.g., KOSMOS, Gemini, GPT‑4O) and expert‑model stitching (e.g., BLIP‑2, Idefics2, InternVL2)—and compares their training costs.
It outlines three generations of LMM development: the first generation (2022‑2023) focusing on low‑resolution image‑question answering, the second generation adding target‑localization capabilities, and the third generation overcoming high‑resolution support and modality competition.
360’s own work is presented, including the Chinese cross‑modal model R2D2, the large‑scale Zero datasets, and the open‑source multimodal models SEEChat and its successor 360VL, which achieve state‑of‑the‑art performance on benchmarks such as MMMU.
A novel Inner‑Adaptor Architecture (IAA) is introduced to preserve the frozen language model while efficiently injecting visual information, achieving competitive results on multimodal and grounding benchmarks without degrading NLP abilities.
Finally, the article showcases practical deployments of 360’s multimodal models in products such as children’s smart watches, image‑tagging services, video surveillance, open‑world object detection, and security inspection, serving tens of thousands of enterprises.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.