Is GLM-4.7-Flash the New 30B‑Level LLM King? Open‑Source and Ollama‑Ready
GLM‑4.7‑Flash, a 30B‑parameter MoE LLM released as fully open‑source and free, delivers 30B‑class performance across six benchmarks, runs locally with a single Ollama command, and offers a faster cloud‑hosted version with modest token‑based pricing, though hardware costs still apply.
Model Overview
GLM‑4.7‑Flash is the latest member of the GLM family. It is built as a 30B‑parameter A3B Mixture‑of‑Experts (MoE) model. Although the full model contains 30 billion parameters, only about 3 billion are activated for any given task, giving it the intelligence of a 30B model while requiring roughly the compute of a 7B model.
Benchmark Performance
The official benchmark suite shows that GLM‑4.7‑Flash achieves 30B‑level results on six core tests:
SWE‑bench Verified (software engineering): 59.2, far surpassing Qwen3‑30B (22.0) and GPT‑OSS‑20B (34.0).
AIME‑25 (US math contest): 91.6, comparable to GPT‑OSS‑20B.
GPQA (graduate‑level science QA): 75.2, slightly higher than competing models.
τ²‑Bench (tool use & complex reasoning): 79.5, about 1.6× the score of Qwen3‑30B.
BrowseComp (web‑understanding): 42.8; note that the official data for Qwen3 was later corrected from 22.9 to 2.3, highlighting test variability.
Local Deployment via Ollama
Ollama added support for GLM‑4.7‑Flash immediately after its release. With Ollama v0.14.3 or newer, the model can be started with a single command: ollama run glm-4.7-flash For Anthropic‑compatible usage, set the environment variables and launch Claude Code:
# Set Anthropic compatible mode
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
# Start Claude Code
claude --model glm-4.7-flashDevelopers can also call the model directly through the Anthropic SDK. Example in Python:
import anthropic
client = anthropic.Anthropic(
base_url='http://localhost:11434',
api_key='ollama',
)
message = client.messages.create(
model='glm-4.7-flash',
messages=[{'role': 'user', 'content': '帮我写一个快速排序算法'}]
)
print(message.content[0].text)Cloud‑Hosted Variant and Pricing
GLM‑4.7‑Flash also has a cloud version, GLM‑4.7‑FlashX, which runs faster but is accessed via Z.AI’s API and billed per token:
Input: $0.07 per million tokens
Output: $0.40 per million tokens
By comparison, the earlier GLM‑4.5 model charges $0.?? for input and $2.2 for output per million tokens, making the new pricing relatively cheap.
Cost Considerations
While the open‑source model itself is free, running it locally requires hardware. A single RTX 4090 GPU costs around ¥15,000, and an Apple Silicon M2 Ultra Mac Studio starts at ¥30,000, which may be prohibitive for individual developers.
Thus, the model is truly cost‑free only when the user already owns suitable hardware; otherwise, electricity and hardware acquisition represent the primary expenses.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
