How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI

OPPO AI Center introduces AndesVL, an open‑source, fully‑adapted multimodal large model ranging from 0.6B to 4B parameters, designed for high‑performance, privacy‑preserving, low‑latency AI on mobile devices, with advanced architecture, training pipelines, on‑device optimizations, and state‑of‑the‑art benchmark results.

DataFunSummit
DataFunSummit
DataFunSummit
How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI

Technical Background

Current on‑device multimodal models suffer from limited performance, capability, and adaptability, hindering high‑performance, privacy‑focused, low‑latency AI phone applications. OPPO AI Center released AndesVL, an open‑source, full‑stack adapted multimodal model addressing these challenges.

Model Architecture

AndesVL adopts a unified architecture with four size tiers (0.6B‑4B) and supports both Instruct and Thinking modes. It combines a Vision Transformer encoder (AimV2‑300M or SigLIP‑2‑base), a multi‑layer perceptron, and a Qwen3‑based large language model, integrating 2D‑RoPE and NaViT for flexible resolution handling.

Training Scheme

The training consists of a pre‑training phase (visual‑language alignment, joint visual‑language training, and multi‑task training) followed by a post‑training phase (supervised fine‑tuning, mixed‑preference optimization, and reinforcement learning). Techniques such as MPO, GRPO, and extensive data filtering yield high‑quality multimodal data.

On‑Device Deployment Solutions

To enable efficient mobile deployment, OPPO applied model sparsification (up to 75% sparsity, 1.8‑bit BPW), quantization‑aware training (QAT) with a dual‑weight framework, and the QALFT method for LoRA‑specific quantization. Additionally, encoding compression (OKV) and speculative decoding (EAGLE‑2, HASS) achieve up to 6.7× decoding speedup and support 128K context length.

Evaluation Results

AndesVL attains top scores on over 30 benchmarks, surpassing competitors in overall ability, mathematical reasoning, visual‑text understanding, multi‑image comprehension, general QA, hallucination suppression, and multilingual tasks. Both 4B and smaller models demonstrate leading performance across diverse verticals.

Future Outlook

The team plans to advance visual encoder designs, post‑training strategies, knowledge distillation, and integrate vision‑language‑speech tri‑modal models, further pushing mobile AI capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Mobile AImodel compressionlarge language modelon-device AImultimodal model
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.