Recent Progress in Vision-Language Models (VLMs)
Over the past year, Vision‑Language Models have surged from early multimodal experiments to competitive open‑source systems rivaling GPT‑4, driven by higher‑resolution processing, richer vision encoders, better projection layers, and larger curated datasets, yet they still face evaluation difficulties, hallucinations, speed limits, and limited multimodal output.
Since the release of ChatGPT, the field of artificial intelligence has undergone a rapid transformation, especially in Vision‑Language Models (VLMs), which combine visual perception and natural‑language understanding to enable tasks such as image captioning, visual question answering, and automatic annotation.
This article reviews the major advances of the past year without delving into detailed paper specifics, aiming to present the core ideas of each work in a concise manner.
Introduction: Early language models like ChatGPT lacked visual capabilities. The March 2023 launch of GPT‑4 dramatically changed expectations by achieving strong image‑understanding performance and setting new records on visual benchmarks.
In November 2022 OpenAI released ChatGPT‑3.5, followed by the multimodal GPT‑4 in March 2023. The breakthrough sparked a surge of VLM research, both open‑source and proprietary, as shown by the rapidly updating model leaderboards.
From the OpenCompass leaderboard we observe:
GPT‑4 remains the undisputed leader, especially the latest GPT‑4o version.
Many emerging models quickly close the gap, surpassing earlier GPT‑4v releases.
Domestic models still lag behind OpenAI, but are comparable to Google’s GeminiVision.
Open‑source models now rank among the top, sometimes outperforming closed‑source counterparts.
For example, the open‑source model internvl‑chat‑1.5 demonstrates impressive document QA, image description, and visual QA capabilities.
Key challenges highlighted:
Evaluating large models is complex; current metrics serve only as references.
Hallucinations, output instability, and limited generalisation remain issues, especially for smaller models.
Processing speed constraints make small models preferable for real‑time applications.
Summary: Open‑source VLMs have progressed from demo‑level to practical utility within a year, yet several limitations persist.
Representative work – LLaVA (2024): With only a few hours of training, LLaVA converts a large language model (Vicuna‑7B) into a VLM by adding a vision encoder (ViT‑L/14) and a simple projection layer. The training data consists of filtered CC3M image‑text pairs (≈595K) and instruction‑following data generated by GPT‑4 (≈158K) plus a Science‑QA multimodal set (≈21K).
Remaining issues of the initial LLaVA version include:
Fixed 224×224 resolution, insufficient for many real‑world tasks.
Questionable suitability of ViT as the vision encoder and a simple linear projection.
Limited training data and frozen vision encoder.
Inability to output detection boxes, segmentation masks, or strong OCR results.
High‑resolution solutions proposed in recent works:
Introduce a high‑resolution branch (e.g., vary ) that processes large images (1024×1024) while keeping token count comparable.
Sliding‑window or patch‑wise processing (e.g., UReader , monkey , llava‑uhd , InternLM‑XComposer2‑4KHD , internvl‑1.5 ) to handle arbitrary resolutions.
Data is the bottleneck for VLMs. Useful data categories include:
Basic caption data.
Detection and segmentation annotations.
Document OCR data.
Complex reasoning chains.
3D/video data.
Domain‑specific images (medical, radar, etc.).
Because VLMs currently rely on image‑text pairs, unsupervised pre‑training like pure image self‑supervision is still limited. Recent surveys such as OmniCorpus and Dataset Selection discuss how to collect and curate high‑quality multimodal data.
Model‑structure innovations in the past year:
Vision encoders: swapping ViT for DeiT, CLIP, MAE, DINO, or dual encoders (SAM + SigLIP) as in DeepSeek‑VL and FUYU .
Projection layers: beyond MLPs, using Q‑Former , Resampler , or C‑Abstractor for richer feature transformation.
LLM backbones: most works reuse open‑source LLMs; some explore separate token handling for image vs. text (e.g., InternLM‑XComposer2 , BEIT3 ).
Integrating generation into VLMs:
Continuous image embeddings fed to diffusion models (e.g., EMU2 , DreamLLM , NExT‑GPT , SEED‑X ).
Discrete image tokens via VQ‑VAE (e.g., LaVIT , Unified‑IO 2 ), with recent work MagVit v2 expanding the token vocabulary to 262K.
Spatial autoregressive generation of image tokens ( VAR ) to match the speed of early diffusion models.
Beyond technical details, the article reflects on the relationship between vision and intelligence. It argues that vision provides new “connection” knowledge, expands the knowledge boundary, improves learning efficiency, and helps correct textual biases. However, current VLMs mainly augment large language models and have not yet demonstrated a clear superiority in logical reasoning.
Conclusion: Open‑source VLMs are still in an early stage; data, multimodal output, and evaluation remain open problems. Nevertheless, the rapid progress over the past year suggests that future VLMs will become increasingly capable and play a pivotal role in the pursuit of artificial general intelligence.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.