13 min read

Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu's machine‑learning platform lead Wang Xin's presentation on the ZhiLight large‑model inference framework, covering model execution mechanisms, GPU workload analysis, pipeline and tensor parallelism, GPU architecture evolution, open‑source engine comparisons, ZhiLight's compute‑communication overlap and quantization optimizations, benchmark results, supported models, and future directions.

DataFunSummit

Mar 14, 2025

Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

Overview

The Zhihu technical salon featured Wang Xin, head of Zhihu's machine‑learning platform, who shared practical experiences from the open‑source large‑model inference framework ZhiLight (https://github.com/zhihu/ZhiLight). The talk focused on the operation principles of large language models (LLMs), inference basics, the use of open‑source technologies, and the implementation details of Zhihu's self‑developed framework.

1. How Large Models Run

LLM inference places massive computational and memory demands on GPUs. Visual models such as ResNet illustrate how convolutional workloads translate to matrix multiplications, highlighting the need for high compute density. Modern transformer‑based models (e.g., GPT series) are decoder‑only, with the bulk of computation residing in linear layers (attention and feed‑forward networks), which are essentially large matrix‑multiply operations.

Model parameter scales have grown from millions to billions, dramatically increasing both compute and memory requirements.

2. Multi‑GPU Parallelism

Because a single GPU cannot handle the workload of very large models, the model must be split across multiple GPUs. Three parallelism strategies were discussed:

Pipeline parallelism – distributes consecutive layers across GPUs (e.g., first 32 layers on GPU 1, next 32 on GPU 2).

Tensor parallelism – partitions the matrix multiplication itself along the M or K dimension, enabling finer‑grained distribution.

Expert parallelism – used in Mixture‑of‑Experts (MoE) models to increase batch size and reduce per‑GPU memory traffic.

Experimental results on LLaMA‑2 showed tensor parallelism achieving more than a 2× latency reduction compared with pipeline parallelism.

3. GPU Architecture Evolution

The article reviews NVIDIA GPU generations relevant to LLM inference:

Ampere (A100) – flagship FP16 performance of 312 TFLOPS.

Ada Lovelace – fourth‑generation Tensor Cores with modest FP16 gains but reduced NVLink bandwidth.

Hopper (H100) – FP16 performance of 989 TFLOPS, at the cost of doubled TDP and limited architectural novelty.

Inter‑GPU communication bandwidth (NVLink vs. PCIe) is identified as a critical bottleneck, especially for cards like RTX 4090 that lack NVLink.

4. Open‑Source Inference Engines in Production

Both vLLM (2023) and SGLang (2024) have been deployed at Zhihu. Benchmarks demonstrate SGLang consistently outperforming vLLM in latency, stability, model coverage, and observability.

However, limitations were observed: similar‑spec cards (e.g., A100‑80G vs. RTX 4090) exhibited large performance gaps due to communication‑compute inefficiencies.

5. Zhihu's Self‑Developed Engine ZhiLight

ZhiLight introduces two key optimizations:

Compute‑communication overlap – pipelines computation and AllReduce communication to reduce per‑layer latency from 19 ms to 12 ms.

AllReduce data quantization – downgrades FP16 payloads to INT8, further cutting decode latency to 10 ms and reducing overall inference time by over 40%.

Benchmark charts show ZhiLight achieving lower first‑token latency (both average and P95) than vLLM and SGLang across model sizes, with the advantage growing for larger models.

6. Model and Hardware Support

ZhiLight currently targets PCIe‑based GPUs (e.g., RTX 4090) and supports Ampere and Ada Lovelace architectures. Supported models include CPM 1/2/3, MiniCPM, Llama 1/2/3, Mixtral MoE, Command‑R, Qwen 1/2/3, DeepSeek MoE V2/V3, DeepSeek R1, and DeepSeek VL.

7. Future Outlook

The team plans to extend multi‑card inference to NVLink and RDMA‑based PD‑separated architectures and to add multimodal model support.

Overall, the presentation highlighted the challenges of LLM inference at scale and demonstrated how ZhiLight's overlapping and quantization techniques can substantially improve performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization LLM GPU inference parallelism Open‑source

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.