Is GLM‑4‑9B the New Powerhouse? A Deep Dive into Its Performance and Usage

This article reviews the open‑source 9‑billion‑parameter GLM‑4‑9B model, covering installation, quick‑start inference code, quirky Chinese riddles that highlight its strengths over GPT‑4, extensive benchmark tables for dialogue, multilingual, tool‑calling and multimodal tasks, and its broader impact on the Chinese AI ecosystem.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Is GLM‑4‑9B the New Powerhouse? A Deep Dive into Its Performance and Usage

Model Overview

GLM-4-9B is an open‑source 9‑billion‑parameter large language model released by Zhipu AI. It is part of the GLM‑4 series and provides three main families:

GLM‑4‑9B – base model, 8K token context.

GLM‑4‑9B‑Chat – chat‑tuned, 128K token context.

GLM‑4‑9B‑Chat‑1M – chat‑tuned, 1 million token context.

GLM‑4V‑9B – multimodal variant (vision + language), 8K token context.

The models are released under an open‑source license, with code and weights hosted at:

https://github.com/THUDM/GLM-4/tree/main

They support Retrieval‑Augmented Generation (RAG), tool‑calling, and up to 1 M tokens of context, enabling “needle‑in‑a‑haystack” retrieval experiments with 100 % success.

Installation and Basic API Call

Install the official SDK and query the model via the REST API:

# pip install zhipuai
from zhipuai import ZhipuAI
client = ZhipuAI(api_key="YOUR_API_KEY")
response = client.chat.completions.create(
    model="glm-4-9b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "你好"}
    ],
    top_p=0.7,
    temperature=0.95,
    max_tokens=1024,
    tools=[{"type": "web_search", "web_search": {"search_result": true}}],
    stream=True,
)
for chunk in response:
    print(chunk)

Quick‑Start Inference – Transformers Backend

Run the model locally with transformers and torch (single‑GPU recommended). The example uses the chat‑tuned checkpoint THUDM/glm-4-9b-chat:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(
    "THUDM/glm-4-9b-chat", trust_remote_code=True
)
query = "你好"
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": query}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
)
inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "THUDM/glm-4-9b-chat",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs["input_ids"].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quick‑Start Inference – vLLM Backend

For higher throughput, use the vllm engine. The following script runs the 128K‑context chat model:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 131072, 1  # 128K tokens, single GPU
model_name = "THUDM/glm-4-9b-chat"
prompt = "你好"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)
inputs = tokenizer.build_chat_input(prompt, history=None, role='user')["input_ids"].tolist()
outputs = llm.generate(prompt_token_ids=inputs, sampling_params=sampling_params)
print([output.outputs[0].text for output in outputs])

Multimodal Inference (GLM‑4V‑9B)

GLM‑4V‑9B adds vision capabilities. Example using transformers:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)
query = "描述这张图片"
image = Image.open("your_image.jpg").convert('RGB')
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "image": image, "content": query}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
)
inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    "THUDM/glm-4v-9b",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs["input_ids"].shape[1]:]
    print(tokenizer.decode(outputs[0]))

Benchmark Highlights

GLM‑4‑9B‑Chat consistently outperforms Llama‑3‑8B‑Instruct on a wide range of dialogue and reasoning benchmarks, approaching GPT‑4‑Turbo performance.

AlignBench : 7.01 (GLM‑4‑9B‑Chat) vs. 6.40 (Llama‑3‑8B‑Instruct)

MT‑Bench : 8.35 vs. 8.00

HumanEval : 71.8 vs. 62.2

MMLU (base) : 74.7 vs. 66.6

C‑Eval (base) : 77.1 vs. 51.2

GSM8K (base) : 84.0 vs. 45.8

Multilingual evaluation on six datasets shows GLM‑4‑9B‑Chat leading Llama‑3‑8B‑Instruct:

M‑MMLU: 56.6 vs. 49.6

FLORES: 28.8 vs. 25.0

XStoryCloze: 90.7 vs. 84.7

Tool‑calling performance on the Berkeley Function Calling Leaderboard:

Overall accuracy: 81.00 (GLM‑4‑9B‑Chat) – comparable to GPT‑4‑Turbo (81.24) and far above ChatGLM3‑6B (57.88).

Multimodal visual‑language benchmarks (MMBench, SEEDBench_IMG, MMStar, MMMU, MME, HallusionBench, AI2D, OCRBench) place GLM‑4V‑9B on par with or above leading models such as GPT‑4o and InternVL‑Chat‑V1.5.

Evaluation Script Example

For the 1 M‑token “needle‑in‑a‑haystack” experiment, the evaluation script is available at:

https://github.com/LargeWorldModel/LWM/blob/main/scripts/eval_needle.py

GLM‑4‑9B was pretrained with a mixture of mathematics, reasoning, and code instruction data, which is why it is compared against Llama‑3‑8B‑Instruct.

Key Takeaways

Despite its modest 9 B parameter count, GLM‑4‑9B delivers competitive performance across dialogue, multilingual, tool‑calling, and multimodal tasks, while supporting single‑GPU deployment and extremely long context windows (up to 1 M tokens). This makes it a practical choice for researchers and developers seeking high‑quality open‑source LLMs without large hardware requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIModel EvaluationMultimodalTool CallingGLM-4-9B
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.