Artificial Intelligence 6 min read

Alibaba Qwen‑3.5 Small Models: 0.8B Parameters Enable Video on Edge Devices

Alibaba released four Qwen‑3.5 models (0.8B‑9B) that use a Gated DeltaNet hybrid‑attention architecture and native multimodal training to achieve 262k‑token contexts, outperform larger rivals on visual, reasoning, and math benchmarks, and run video analysis on phones and laptops, though they still demand significant VRAM.

AI Engineering

Mar 3, 2026

Alibaba Qwen‑3.5 Small Models: 0.8B Parameters Enable Video on Edge Devices

Alibaba Tongyi Qwen released the Qwen‑3.5 series, comprising four compact models (0.8 B, 2 B, 4 B, 9 B) that leverage a Gated DeltaNet hybrid‑attention architecture. The design pairs three linear‑attention layers with one full‑attention layer, keeping memory usage constant while activating full attention only when precise computation is needed. This 3:1 ratio lets the 0.8 B model support a 262 k‑token context window.

Native Multimodal Design

From the start the models are trained with multimodal tokens for early‑fusion. The visual encoder uses 3‑D convolutions to capture motion, allowing the 4 B and 9 B variants to understand UI screens and identify objects in video—capabilities that previously required models an order of magnitude larger.

Performance Highlights

On the MMMU‑Pro visual‑reasoning benchmark the 9 B model scores 70.1, beating Gemini 2.5 Flash‑Lite’s 59.7. In the GPQA‑Diamond test it reaches 81.7, surpassing the 120 B GPT‑OSS‑120B’s 80.1. Video‑MME results are 84.5 (9 B) and 83.5 (4 B) versus Gemini 2.5’s 74.6. In a Harvard‑MIT math competition the 9 B model attains 83.2, while the 4 B variant scores 74.0.

Developer Feedback

Developers expressed excitement; one noted that the 4 B model feels as strong as the previous 80 B A3B, and the 9 B model matches GPT‑OSS‑120B while being 13 × smaller, runnable on a laptop. Karan Kendre reported free local execution on an M1 MacBook Air, and Xenova from Hugging Face highlighted browser‑based video analysis.

Practical Use Cases

Visual workflow automation: Pixel‑level positioning enables desktop or mobile UI navigation, form filling, and file organization via natural‑language commands.

Document parsing: Exceeds 90 % on document‑understanding benchmarks, replacing separate OCR and layout pipelines.

Code processing: Accepts up to 1 M tokens (≈400 k lines of code) for production‑grade refactoring or automated debugging.

Edge analysis: The 0.8 B and 2 B models run offline on mobile devices, delivering video summarization (up to 60 s at 8 FPS) and spatial reasoning without draining battery.

Known Limitations

Hallucination cascades can occur in multi‑step workflows, where early mistakes propagate into nonsensical plans. Debugging complex legacy code remains challenging despite strong code‑generation ability. Even the “small” 9 B model demands substantial VRAM for high‑throughput inference.

Summary

The Qwen‑3.5 series demonstrates that Gated DeltaNet and native multimodal training give tiny models capabilities once reserved for much larger systems. A 0.8 B model can process video on phones, while the 9 B model outperforms 120 B competitors on several benchmarks. The models are publicly available on Hugging Face and ModelScope, with Ollama and Unsloth already supporting them.

multimodal AI Edge AI benchmark small language models Qwen3.5 Gated DeltaNet

Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.