Artificial Intelligence 8 min read

Open-Source Small LLMs Reach GPT‑5‑Level Intelligence: One‑Stop Evaluation of Qwen 3.5, Gemma 4 and Other Top Models

A recent Artificial Analysis report finds that the 27‑billion‑parameter Qwen 3.5 and 31‑billion‑parameter Gemma 4 models achieve Intelligence Index scores comparable to GPT‑5, and the article details their benchmark results, multimodal capabilities, deployment on a single NVIDIA H100, and provides one‑click notebook tutorials for several open‑source LLMs.

HyperAI Super Neural

Apr 16, 2026

Open-Source Small LLMs Reach GPT‑5‑Level Intelligence: One‑Stop Evaluation of Qwen 3.5, Gemma 4 and Other Top Models

Benchmark Findings

Artificial Analysis released on April 14 a comparative report of open‑source large language models under 32 B parameters. The report shows that Qwen 3.5‑27B (inference version) reaches an Intelligence Index of 42, matching the medium tier of GPT‑5, while Gemma 4‑31B (inference version) scores 39, comparable to the low tier of GPT‑5. On the Agentic Index, Qwen 3.5‑27B scores 55, surpassing GPT‑5‑medium’s 46, and Gemma 4‑31B leads GPT‑5‑low on complex tasks such as TerminalBench Hard and HLE. Both models natively support multimodal input and rank near the top of open‑source models on visual‑understanding benchmarks like MMMU‑Pro.

Limitations

Despite these gains, the smaller models lag in knowledge accuracy and hallucination control. Their AA‑Omniscience scores are –42 (Qwen 3.5) and –45 (Gemma 4), compared with –10 for the corresponding GPT‑5 versions, indicating that parameter scale still affects factual memory.

Deployment Advantages

The two models can run on a single NVIDIA H100 GPU and, after quantization, can be deployed locally on personal devices, lowering the barrier to use. The open‑source weight ecosystem is rapidly closing the gap with frontier models; for example, GLM‑5.1’s scores are now within a few points of the leading models.

Additional Model Highlights

The article also introduces NVIDIA’s Nemotron‑3‑Super‑120B, a 120 billion‑parameter model with a 12 billion‑parameter activation size, LatentMoE architecture, and a 1 M‑token context window, supporting reasoning mode toggling and tool‑calling workflows.

Jackrong’s Qwen 3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled, released in March 2026, incorporates reasoning‑distilled knowledge from Claude‑4.6 and Opus, improving performance on mathematics, logic, planning, and multi‑step task decomposition.

Google DeepMind’s Gemma 4 series, spanning sizes from 2 B to 31 B, achieves top‑three placement on the Arena AI leaderboard and, despite its smaller parameter count, rivals larger competitors such as Qwen 3.5‑397B. The 31 B variant supports up to 256 K tokens, multimodal I/O, function calling, system prompts, and over 140 languages, excelling in high‑quality QA, code assistance, and agent services.

Practical Tutorials

The article aggregates one‑click notebook tutorials hosted on HyperAI’s tutorial page for the models mentioned above, enabling developers to quickly launch inference services. For example, the Qwen 3.5‑9B GGUF weight can be used with llama.cpp to start an OpenAI‑compatible backend and connect to OpenWebUI for browser‑based chat.

Online Demos

Live demo links are provided for each model (e.g., https://go.hyper.ai/WJmbe for Nemotron‑3‑Super, https://go.hyper.ai/PTR8m for the reasoning‑distilled Qwen 3.5, https://go.hyper.ai/NzyGq for Gemma 4‑31B, and https://go.hyper.ai/sT3nm for Qwen 3.5‑9B GGUF).

deployment open-source LLM Nvidia H100 Model Benchmark Gemma 4 Qwen 3.5 Intelligence Index

Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.