Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations
The article summarizes Zhihu's machine‑learning platform lead Wang Xin's presentation on the ZhiLight large‑model inference framework, covering model execution mechanisms, GPU workload analysis, pipeline and tensor parallelism, GPU architecture evolution, open‑source engine comparisons, ZhiLight's compute‑communication overlap and quantization optimizations, benchmark results, supported models, and future directions.
Overview
The Zhihu technical salon featured Wang Xin, head of Zhihu's machine‑learning platform, who shared practical experiences from the open‑source large‑model inference framework ZhiLight (https://github.com/zhihu/ZhiLight). The talk focused on the operation principles of large language models (LLMs), inference basics, the use of open‑source technologies, and the implementation details of Zhihu's self‑developed framework.
1. How Large Models Run
LLM inference places massive computational and memory demands on GPUs. Visual models such as ResNet illustrate how convolutional workloads translate to matrix multiplications, highlighting the need for high compute density. Modern transformer‑based models (e.g., GPT series) are decoder‑only, with the bulk of computation residing in linear layers (attention and feed‑forward networks), which are essentially large matrix‑multiply operations.
Model parameter scales have grown from millions to billions, dramatically increasing both compute and memory requirements.
2. Multi‑GPU Parallelism
Because a single GPU cannot handle the workload of very large models, the model must be split across multiple GPUs. Three parallelism strategies were discussed:
Pipeline parallelism – distributes consecutive layers across GPUs (e.g., first 32 layers on GPU 1, next 32 on GPU 2).
Tensor parallelism – partitions the matrix multiplication itself along the M or K dimension, enabling finer‑grained distribution.
Expert parallelism – used in Mixture‑of‑Experts (MoE) models to increase batch size and reduce per‑GPU memory traffic.
Experimental results on LLaMA‑2 showed tensor parallelism achieving more than a 2× latency reduction compared with pipeline parallelism.
3. GPU Architecture Evolution
The article reviews NVIDIA GPU generations relevant to LLM inference:
Ampere (A100) – flagship FP16 performance of 312 TFLOPS.
Ada Lovelace – fourth‑generation Tensor Cores with modest FP16 gains but reduced NVLink bandwidth.
Hopper (H100) – FP16 performance of 989 TFLOPS, at the cost of doubled TDP and limited architectural novelty.
Inter‑GPU communication bandwidth (NVLink vs. PCIe) is identified as a critical bottleneck, especially for cards like RTX 4090 that lack NVLink.
4. Open‑Source Inference Engines in Production
Both vLLM (2023) and SGLang (2024) have been deployed at Zhihu. Benchmarks demonstrate SGLang consistently outperforming vLLM in latency, stability, model coverage, and observability.
However, limitations were observed: similar‑spec cards (e.g., A100‑80G vs. RTX 4090) exhibited large performance gaps due to communication‑compute inefficiencies.
5. Zhihu's Self‑Developed Engine ZhiLight
ZhiLight introduces two key optimizations:
Compute‑communication overlap – pipelines computation and AllReduce communication to reduce per‑layer latency from 19 ms to 12 ms.
AllReduce data quantization – downgrades FP16 payloads to INT8, further cutting decode latency to 10 ms and reducing overall inference time by over 40%.
Benchmark charts show ZhiLight achieving lower first‑token latency (both average and P95) than vLLM and SGLang across model sizes, with the advantage growing for larger models.
6. Model and Hardware Support
ZhiLight currently targets PCIe‑based GPUs (e.g., RTX 4090) and supports Ampere and Ada Lovelace architectures. Supported models include CPM 1/2/3, MiniCPM, Llama 1/2/3, Mixtral MoE, Command‑R, Qwen 1/2/3, DeepSeek MoE V2/V3, DeepSeek R1, and DeepSeek VL.
7. Future Outlook
The team plans to extend multi‑card inference to NVLink and RDMA‑based PD‑separated architectures and to add multimodal model support.
Overall, the presentation highlighted the challenges of LLM inference at scale and demonstrated how ZhiLight's overlapping and quantization techniques can substantially improve performance.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.