How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning
Zhipu AI unveiled the GLM-4.1V-Thinking series, an open‑source multimodal model that outperforms larger rivals on visual‑language tasks, supports video analysis, GUI agents, and advanced scientific reasoning, while introducing a curriculum‑sampling reinforcement‑learning framework and a new Agent application platform.
Zhipu AI announced the GLM-4.1V-Thinking series, releasing the open‑source GLM-4.1V-9B-Thinking model, which achieves leading performance among 10B‑scale visual‑language models and surpasses larger models such as Qwen‑2.5‑VL‑72B and GPT‑4o on multiple benchmarks.
The model can parse up to two‑hour videos, perform deep image analysis, and understand sports tactics, while also supporting GUI Agent capabilities like web‑page, desktop, and mobile screen interaction, enabling actions such as clicking and scrolling.
Key research contributions include curriculum‑sampling reinforcement learning (RL) that boosts reasoning, 2D‑RoPE for handling extreme aspect ratios and high‑resolution inputs, and a 3D‑RoPE extension in the language decoder that enhances spatial understanding of multimodal inputs.
GLM-4.1V-Thinking’s architecture comprises three core modules: a Vision Transformer encoder (AIMv2‑Huge), an MLP projector, and a GLM language decoder. The encoder replaces 2‑D convolutions with 3‑D convolutions for efficient video processing, and retains absolute positional embeddings with bicubic interpolation for variable resolutions.
Training proceeds in three stages: multimodal pre‑training (including image‑caption, OCR, grounding, and instruction data), long‑context continuous training (video frames and >8K token sequences), and supervised fine‑tuning with a high‑quality chain‑of‑thought dataset covering math, dialogue, and agent planning.
Curriculum‑sampling RL combines RL with verifiable rewards (RLVR) and human‑feedback RL (RLHF), applying a difficulty‑ordered curriculum across STEM problem solving, multimodal grounding, GUI agent tasks, and complex instruction execution, yielding strong cross‑domain generalization.
The model is available on GitHub, ModelScope, and Hugging Face for local deployment, with an online demo supporting image, video, PPT, and PDF inputs. The technical report and code are publicly released.
In parallel, Zhipu launched an Agent Application Platform that aggregates Agent capabilities and model plugins, offering plug‑and‑play components and flexible orchestration for enterprises without building their own large‑model teams.
Funding updates: Zhipu secured a 1 billion CNY strategic investment from Pudong Capital and Zhangjiang Group, and has attracted local state‑capital backing from five Chinese cities, totaling over 25 billion CNY in 2024.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
