Is GLM-4.7-Flash the New 30B‑Level LLM King? Open‑Source and Ollama‑Ready

GLM‑4.7‑Flash, a 30B‑parameter MoE LLM released as fully open‑source and free, delivers 30B‑class performance across six benchmarks, runs locally with a single Ollama command, and offers a faster cloud‑hosted version with modest token‑based pricing, though hardware costs still apply.

AI Insight Log
AI Insight Log
AI Insight Log
Is GLM-4.7-Flash the New 30B‑Level LLM King? Open‑Source and Ollama‑Ready

Model Overview

GLM‑4.7‑Flash is the latest member of the GLM family. It is built as a 30B‑parameter A3B Mixture‑of‑Experts (MoE) model. Although the full model contains 30 billion parameters, only about 3 billion are activated for any given task, giving it the intelligence of a 30B model while requiring roughly the compute of a 7B model.

Benchmark Performance

The official benchmark suite shows that GLM‑4.7‑Flash achieves 30B‑level results on six core tests:

SWE‑bench Verified (software engineering): 59.2, far surpassing Qwen3‑30B (22.0) and GPT‑OSS‑20B (34.0).

AIME‑25 (US math contest): 91.6, comparable to GPT‑OSS‑20B.

GPQA (graduate‑level science QA): 75.2, slightly higher than competing models.

τ²‑Bench (tool use & complex reasoning): 79.5, about 1.6× the score of Qwen3‑30B.

BrowseComp (web‑understanding): 42.8; note that the official data for Qwen3 was later corrected from 22.9 to 2.3, highlighting test variability.

Local Deployment via Ollama

Ollama added support for GLM‑4.7‑Flash immediately after its release. With Ollama v0.14.3 or newer, the model can be started with a single command: ollama run glm-4.7-flash For Anthropic‑compatible usage, set the environment variables and launch Claude Code:

# Set Anthropic compatible mode
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434

# Start Claude Code
claude --model glm-4.7-flash

Developers can also call the model directly through the Anthropic SDK. Example in Python:

import anthropic

client = anthropic.Anthropic(
    base_url='http://localhost:11434',
    api_key='ollama',
)

message = client.messages.create(
    model='glm-4.7-flash',
    messages=[{'role': 'user', 'content': '帮我写一个快速排序算法'}]
)
print(message.content[0].text)

Cloud‑Hosted Variant and Pricing

GLM‑4.7‑Flash also has a cloud version, GLM‑4.7‑FlashX, which runs faster but is accessed via Z.AI’s API and billed per token:

Input: $0.07 per million tokens

Output: $0.40 per million tokens

By comparison, the earlier GLM‑4.5 model charges $0.?? for input and $2.2 for output per million tokens, making the new pricing relatively cheap.

Cost Considerations

While the open‑source model itself is free, running it locally requires hardware. A single RTX 4090 GPU costs around ¥15,000, and an Apple Silicon M2 Ultra Mac Studio starts at ¥30,000, which may be prohibitive for individual developers.

Thus, the model is truly cost‑free only when the user already owns suitable hardware; otherwise, electricity and hardware acquisition represent the primary expenses.

GLM-4.7-Flash performance comparison
GLM-4.7-Flash performance comparison
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Mixture of ExpertsbenchmarkpricingOllamaopen-source LLMGLM-4.7-FlashAnthropic API
AI Insight Log
Written by

AI Insight Log

Focused on sharing: AI programming | Agents | Tools

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.