Optimizing ChatGLM-6B Deployment with MNN: Model Conversion, Quantization, and Edge Inference
The article details a workflow that converts the PyTorch ChatGLM‑6B model to MNN, splits and compresses embeddings, applies int4/int8 quantization, supports dynamic shapes, and uses hybrid GPU/CPU or CPU‑only loading to enable low‑memory edge inference on PCs and mobile devices with competitive token‑per‑second performance.
Large language models (LLMs) such as ChatGLM-6B provide strong bilingual dialogue capabilities but their massive parameter count makes deployment on limited hardware difficult. This article presents a workflow that converts the PyTorch ChatGLM-6B model to an MNN model, applies low‑bit quantization, and restructures the model for efficient edge‑side inference.
Model Export – The model is exported from PyTorch to ONNX (or TorchScript) and then to MNN. The export splits the network into embedding, 28 GLMBlocks, and the final linear layer, and also trims the vocabulary.
Export code example: torch.onnx.export(model, model_args, f=output_path, input_names=[...], output_names=[...], dynamic_axes={...}, opset_version=14)
Structure Splitting & Memory Reduction – Embedding parameters (150528×4096) are stored as binary files and accessed via fseek/fread to save ~2.3 GB. GLMBlock weights (~21 GB) are split so each block can be loaded on GPU or CPU independently. Quantized int4/int8 versions reduce block size to ~2.6 GB.
Embedding slimming code: import numpy as np embed = np.fromfile('transformer.word_embeddings.weight', dtype=np.float32) embed = embed.reshape(-1, 4096) embed = embed[20000:, :] embed.tofile('slim_word_embeddings.bin')
BF16 conversion (C++) example: FILE* src_f = fopen("slim_word_embeddings.bin", "rb"); std::vector src_buffer(num); fread(src_buffer.data(), 1, num*sizeof(float), src_f); // convert to bf16 std::vector dst_buffer(num); for (int i=0;i (src_buffer.data())[2*i+1]; } FILE* dst_f = fopen("slim_word_embeddings_bf16.bin", "wb"); fwrite(dst_buffer.data(), 1, num*sizeof(int16_t), dst_f);
Dynamic Shape Support – Export specifies dynamic axes for inputs (sequence length, history length) to allow variable‑length inference.
Code Adjustments for Export – Tuple past states are changed to tensors, and view operations are replaced with squeeze / unsqueeze to keep shapes dynamic.
Tuple‑to‑Tensor change example: # before past_key, past_value = layer_past[0], layer_past[1] # after key_layer = torch.cat((past_key_value[0], key_layer), dim=0) value_layer = torch.cat((past_key_value[1], value_layer), dim=0) present = torch.stack((key_layer, value_layer), dim=0)
Low‑Memory Inference – On PC, a hybrid GPU/CPU strategy loads as many blocks as GPU memory permits (e.g., (gpu_memory-2)*1024/385 blocks). On mobile, int4 quantized models run entirely on CPU with the MNN low_memory option that performs de‑quantization inside the GEMM kernel.
PC loading example: void ChatGLM::loadModel(const char* fileName, bool cuda, int i) { Module::Config config; config.shapeMutable=true; config.rearrange=true; auto rtmgr = cuda ? mGPURtmgr : mCPURtmgr; std::shared_ptr net(Module::load({}, {}, fileName, rtmgr, &config)); mModules[i]=std::move(net); } int gpu_run_layers = (gpu_memory-2)*1024.0/385.0; for (int i=0;i
Performance – PC (2080Ti, 11 GB VRAM) with fp32 mixed GPU/CPU achieves 3.5 tok/s; CPU‑only 1.2 tok/s. Mobile (Xiaomi 12) with int4 model reaches 1.5 tok/s using ~2.9 GB RAM.
Demo Interfaces – Provides both command‑line and web UI for PC, and an Android app for mobile.
Conclusion – By segmenting model loading and applying low‑bit quantization, ChatGLM‑6B can be deployed on consumer‑grade GPUs without accuracy loss and on mobile devices with modest accuracy trade‑off, overcoming the primary memory bottleneck of large LLM inference.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.