Artificial Intelligence 16 min read

Optimizing ChatGLM-6B Deployment with MNN: Model Conversion, Quantization, and Edge Inference

The article details a workflow that converts the PyTorch ChatGLM‑6B model to MNN, splits and compresses embeddings, applies int4/int8 quantization, supports dynamic shapes, and uses hybrid GPU/CPU or CPU‑only loading to enable low‑memory edge inference on PCs and mobile devices with competitive token‑per‑second performance.

DaTaobao Tech

Jul 12, 2023

Optimizing ChatGLM-6B Deployment with MNN: Model Conversion, Quantization, and Edge Inference

Large language models (LLMs) such as ChatGLM-6B provide strong bilingual dialogue capabilities but their massive parameter count makes deployment on limited hardware difficult. This article presents a workflow that converts the PyTorch ChatGLM-6B model to an MNN model, applies low‑bit quantization, and restructures the model for efficient edge‑side inference.

Model Export – The model is exported from PyTorch to ONNX (or TorchScript) and then to MNN. The export splits the network into embedding, 28 GLMBlocks, and the final linear layer, and also trims the vocabulary.

Export code example:

torch.onnx.export(model, model_args, f=output_path, input_names=[...], output_names=[...], dynamic_axes={...}, opset_version=14)

Structure Splitting & Memory Reduction – Embedding parameters (150528×4096) are stored as binary files and accessed via fseek/fread to save ~2.3 GB. GLMBlock weights (~21 GB) are split so each block can be loaded on GPU or CPU independently. Quantized int4/int8 versions reduce block size to ~2.6 GB.

Embedding slimming code:

import numpy as np
embed = np.fromfile('transformer.word_embeddings.weight', dtype=np.float32)
embed = embed.reshape(-1, 4096)
embed = embed[20000:, :]
embed.tofile('slim_word_embeddings.bin')

BF16 conversion (C++) example:

FILE* src_f = fopen("slim_word_embeddings.bin", "rb");
std::vector<float> src_buffer(num);
fread(src_buffer.data(), 1, num*sizeof(float), src_f);
// convert to bf16
std::vector<int16_t> dst_buffer(num);
for (int i=0;i<num;i++) {
    dst_buffer[i] = reinterpret_cast<int16_t*>(src_buffer.data())[2*i+1];
}
FILE* dst_f = fopen("slim_word_embeddings_bf16.bin", "wb");
fwrite(dst_buffer.data(), 1, num*sizeof(int16_t), dst_f);

Dynamic Shape Support – Export specifies dynamic axes for inputs (sequence length, history length) to allow variable‑length inference.

Code Adjustments for Export – Tuple past states are changed to tensors, and view operations are replaced with squeeze / unsqueeze to keep shapes dynamic.

Tuple‑to‑Tensor change example:

# before
past_key, past_value = layer_past[0], layer_past[1]
# after
key_layer = torch.cat((past_key_value[0], key_layer), dim=0)
value_layer = torch.cat((past_key_value[1], value_layer), dim=0)
present = torch.stack((key_layer, value_layer), dim=0)

Low‑Memory Inference – On PC, a hybrid GPU/CPU strategy loads as many blocks as GPU memory permits (e.g., (gpu_memory-2)*1024/385 blocks). On mobile, int4 quantized models run entirely on CPU with the MNN low_memory option that performs de‑quantization inside the GEMM kernel.

PC loading example:

void ChatGLM::loadModel(const char* fileName, bool cuda, int i) {
    Module::Config config; config.shapeMutable=true; config.rearrange=true;
    auto rtmgr = cuda ? mGPURtmgr : mCPURtmgr;
    std::shared_ptr<Module> net(Module::load({}, {}, fileName, rtmgr, &config));
    mModules[i]=std::move(net);
}
int gpu_run_layers = (gpu_memory-2)*1024.0/385.0;
for (int i=0;i<LAYER_SIZE;i++) {
    sprintf(buffer, "../resource/models/glm_block_%d.mnn", i);
    loadModel(buffer, i<=gpu_run_layers, i);
}

Performance – PC (2080Ti, 11 GB VRAM) with fp32 mixed GPU/CPU achieves 3.5 tok/s; CPU‑only 1.2 tok/s. Mobile (Xiaomi 12) with int4 model reaches 1.5 tok/s using ~2.9 GB RAM.

Demo Interfaces – Provides both command‑line and web UI for PC, and an Android app for mobile.

Conclusion – By segmenting model loading and applying low‑bit quantization, ChatGLM‑6B can be deployed on consumer‑grade GPUs without accuracy loss and on mobile devices with modest accuracy trade‑off, overcoming the primary memory bottleneck of large LLM inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM MNN ChatGLM edge inference model conversion

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.