How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI
OPPO AI Center introduces AndesVL, an open‑source, fully‑adapted multimodal large model ranging from 0.6B to 4B parameters, designed for high‑performance, privacy‑preserving, low‑latency AI on mobile devices, with advanced architecture, training pipelines, on‑device optimizations, and state‑of‑the‑art benchmark results.
Technical Background
Current on‑device multimodal models suffer from limited performance, capability, and adaptability, hindering high‑performance, privacy‑focused, low‑latency AI phone applications. OPPO AI Center released AndesVL, an open‑source, full‑stack adapted multimodal model addressing these challenges.
Model Architecture
AndesVL adopts a unified architecture with four size tiers (0.6B‑4B) and supports both Instruct and Thinking modes. It combines a Vision Transformer encoder (AimV2‑300M or SigLIP‑2‑base), a multi‑layer perceptron, and a Qwen3‑based large language model, integrating 2D‑RoPE and NaViT for flexible resolution handling.
Training Scheme
The training consists of a pre‑training phase (visual‑language alignment, joint visual‑language training, and multi‑task training) followed by a post‑training phase (supervised fine‑tuning, mixed‑preference optimization, and reinforcement learning). Techniques such as MPO, GRPO, and extensive data filtering yield high‑quality multimodal data.
On‑Device Deployment Solutions
To enable efficient mobile deployment, OPPO applied model sparsification (up to 75% sparsity, 1.8‑bit BPW), quantization‑aware training (QAT) with a dual‑weight framework, and the QALFT method for LoRA‑specific quantization. Additionally, encoding compression (OKV) and speculative decoding (EAGLE‑2, HASS) achieve up to 6.7× decoding speedup and support 128K context length.
Evaluation Results
AndesVL attains top scores on over 30 benchmarks, surpassing competitors in overall ability, mathematical reasoning, visual‑text understanding, multi‑image comprehension, general QA, hallucination suppression, and multilingual tasks. Both 4B and smaller models demonstrate leading performance across diverse verticals.
Future Outlook
The team plans to advance visual encoder designs, post‑training strategies, knowledge distillation, and integrate vision‑language‑speech tri‑modal models, further pushing mobile AI capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
