Open-Source Small LLMs Reach GPT‑5‑Level Intelligence: One‑Stop Evaluation of Qwen 3.5, Gemma 4 and Other Top Models
A recent Artificial Analysis report finds that the 27‑billion‑parameter Qwen 3.5 and 31‑billion‑parameter Gemma 4 models achieve Intelligence Index scores comparable to GPT‑5, and the article details their benchmark results, multimodal capabilities, deployment on a single NVIDIA H100, and provides one‑click notebook tutorials for several open‑source LLMs.
Benchmark Findings
Artificial Analysis released on April 14 a comparative report of open‑source large language models under 32 B parameters. The report shows that Qwen 3.5‑27B (inference version) reaches an Intelligence Index of 42, matching the medium tier of GPT‑5, while Gemma 4‑31B (inference version) scores 39, comparable to the low tier of GPT‑5. On the Agentic Index, Qwen 3.5‑27B scores 55, surpassing GPT‑5‑medium’s 46, and Gemma 4‑31B leads GPT‑5‑low on complex tasks such as TerminalBench Hard and HLE. Both models natively support multimodal input and rank near the top of open‑source models on visual‑understanding benchmarks like MMMU‑Pro.
Limitations
Despite these gains, the smaller models lag in knowledge accuracy and hallucination control. Their AA‑Omniscience scores are –42 (Qwen 3.5) and –45 (Gemma 4), compared with –10 for the corresponding GPT‑5 versions, indicating that parameter scale still affects factual memory.
Deployment Advantages
The two models can run on a single NVIDIA H100 GPU and, after quantization, can be deployed locally on personal devices, lowering the barrier to use. The open‑source weight ecosystem is rapidly closing the gap with frontier models; for example, GLM‑5.1’s scores are now within a few points of the leading models.
Additional Model Highlights
The article also introduces NVIDIA’s Nemotron‑3‑Super‑120B, a 120 billion‑parameter model with a 12 billion‑parameter activation size, LatentMoE architecture, and a 1 M‑token context window, supporting reasoning mode toggling and tool‑calling workflows.
Jackrong’s Qwen 3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled, released in March 2026, incorporates reasoning‑distilled knowledge from Claude‑4.6 and Opus, improving performance on mathematics, logic, planning, and multi‑step task decomposition.
Google DeepMind’s Gemma 4 series, spanning sizes from 2 B to 31 B, achieves top‑three placement on the Arena AI leaderboard and, despite its smaller parameter count, rivals larger competitors such as Qwen 3.5‑397B. The 31 B variant supports up to 256 K tokens, multimodal I/O, function calling, system prompts, and over 140 languages, excelling in high‑quality QA, code assistance, and agent services.
Practical Tutorials
The article aggregates one‑click notebook tutorials hosted on HyperAI’s tutorial page for the models mentioned above, enabling developers to quickly launch inference services. For example, the Qwen 3.5‑9B GGUF weight can be used with llama.cpp to start an OpenAI‑compatible backend and connect to OpenWebUI for browser‑based chat.
Online Demos
Live demo links are provided for each model (e.g., https://go.hyper.ai/WJmbe for Nemotron‑3‑Super, https://go.hyper.ai/PTR8m for the reasoning‑distilled Qwen 3.5, https://go.hyper.ai/NzyGq for Gemma 4‑31B, and https://go.hyper.ai/sT3nm for Qwen 3.5‑9B GGUF).
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
