Is GLM‑4‑9B the New Powerhouse? A Deep Dive into Its Performance and Usage
This article reviews the open‑source 9‑billion‑parameter GLM‑4‑9B model, covering installation, quick‑start inference code, quirky Chinese riddles that highlight its strengths over GPT‑4, extensive benchmark tables for dialogue, multilingual, tool‑calling and multimodal tasks, and its broader impact on the Chinese AI ecosystem.
Model Overview
GLM-4-9B is an open‑source 9‑billion‑parameter large language model released by Zhipu AI. It is part of the GLM‑4 series and provides three main families:
GLM‑4‑9B – base model, 8K token context.
GLM‑4‑9B‑Chat – chat‑tuned, 128K token context.
GLM‑4‑9B‑Chat‑1M – chat‑tuned, 1 million token context.
GLM‑4V‑9B – multimodal variant (vision + language), 8K token context.
The models are released under an open‑source license, with code and weights hosted at:
https://github.com/THUDM/GLM-4/tree/main
They support Retrieval‑Augmented Generation (RAG), tool‑calling, and up to 1 M tokens of context, enabling “needle‑in‑a‑haystack” retrieval experiments with 100 % success.
Installation and Basic API Call
Install the official SDK and query the model via the REST API:
# pip install zhipuai
from zhipuai import ZhipuAI
client = ZhipuAI(api_key="YOUR_API_KEY")
response = client.chat.completions.create(
model="glm-4-9b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "你好"}
],
top_p=0.7,
temperature=0.95,
max_tokens=1024,
tools=[{"type": "web_search", "web_search": {"search_result": true}}],
stream=True,
)
for chunk in response:
print(chunk)Quick‑Start Inference – Transformers Backend
Run the model locally with transformers and torch (single‑GPU recommended). The example uses the chat‑tuned checkpoint THUDM/glm-4-9b-chat:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(
"THUDM/glm-4-9b-chat", trust_remote_code=True
)
query = "你好"
inputs = tokenizer.apply_chat_template(
[{"role": "user", "content": query}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
)
inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
"THUDM/glm-4-9b-chat",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
).to(device).eval()
gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs["input_ids"].shape[1]:]
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Quick‑Start Inference – vLLM Backend
For higher throughput, use the vllm engine. The following script runs the 128K‑context chat model:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
max_model_len, tp_size = 131072, 1 # 128K tokens, single GPU
model_name = "THUDM/glm-4-9b-chat"
prompt = "你好"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
model=model_name,
tensor_parallel_size=tp_size,
max_model_len=max_model_len,
trust_remote_code=True,
enforce_eager=True,
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)
inputs = tokenizer.build_chat_input(prompt, history=None, role='user')["input_ids"].tolist()
outputs = llm.generate(prompt_token_ids=inputs, sampling_params=sampling_params)
print([output.outputs[0].text for output in outputs])Multimodal Inference (GLM‑4V‑9B)
GLM‑4V‑9B adds vision capabilities. Example using transformers:
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)
query = "描述这张图片"
image = Image.open("your_image.jpg").convert('RGB')
inputs = tokenizer.apply_chat_template(
[{"role": "user", "image": image, "content": query}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
)
inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
"THUDM/glm-4v-9b",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
).to(device).eval()
gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs["input_ids"].shape[1]:]
print(tokenizer.decode(outputs[0]))Benchmark Highlights
GLM‑4‑9B‑Chat consistently outperforms Llama‑3‑8B‑Instruct on a wide range of dialogue and reasoning benchmarks, approaching GPT‑4‑Turbo performance.
AlignBench : 7.01 (GLM‑4‑9B‑Chat) vs. 6.40 (Llama‑3‑8B‑Instruct)
MT‑Bench : 8.35 vs. 8.00
HumanEval : 71.8 vs. 62.2
MMLU (base) : 74.7 vs. 66.6
C‑Eval (base) : 77.1 vs. 51.2
GSM8K (base) : 84.0 vs. 45.8
Multilingual evaluation on six datasets shows GLM‑4‑9B‑Chat leading Llama‑3‑8B‑Instruct:
M‑MMLU: 56.6 vs. 49.6
FLORES: 28.8 vs. 25.0
XStoryCloze: 90.7 vs. 84.7
Tool‑calling performance on the Berkeley Function Calling Leaderboard:
Overall accuracy: 81.00 (GLM‑4‑9B‑Chat) – comparable to GPT‑4‑Turbo (81.24) and far above ChatGLM3‑6B (57.88).
Multimodal visual‑language benchmarks (MMBench, SEEDBench_IMG, MMStar, MMMU, MME, HallusionBench, AI2D, OCRBench) place GLM‑4V‑9B on par with or above leading models such as GPT‑4o and InternVL‑Chat‑V1.5.
Evaluation Script Example
For the 1 M‑token “needle‑in‑a‑haystack” experiment, the evaluation script is available at:
https://github.com/LargeWorldModel/LWM/blob/main/scripts/eval_needle.py
GLM‑4‑9B was pretrained with a mixture of mathematics, reasoning, and code instruction data, which is why it is compared against Llama‑3‑8B‑Instruct.
Key Takeaways
Despite its modest 9 B parameter count, GLM‑4‑9B delivers competitive performance across dialogue, multilingual, tool‑calling, and multimodal tasks, while supporting single‑GPU deployment and extremely long context windows (up to 1 M tokens). This makes it a practical choice for researchers and developers seeking high‑quality open‑source LLMs without large hardware requirements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
