Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations
The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.
In the first Zhihu Tech Salon, Wang Xin, head of Zhihu's Machine Learning Platform, presented the design and deployment experience of Zhihu's large‑model inference framework ZhiLight.
The talk covered how large models run, basics of model inference, the use of open‑source inference frameworks, and practical lessons from the self‑developed ZhiLight framework.
How large models operate
Application of open‑source inference frameworks in production
Zhihu’s self‑built ZhiLight framework
Analysis of model load and GPU design shows that single‑GPU inference cannot handle the compute demand of modern LLMs, leading to multi‑GPU parallelism strategies such as pipeline (流水) parallelism, tensor parallelism, and expert parallelism. Experiments on LLaMA‑2 models demonstrated that tensor parallelism reduces latency by more than two‑fold compared with pipeline parallelism.
GPU architecture evolution (Ampere, Ada Lovelace, Hopper) and their compute capabilities were discussed, highlighting differences in inter‑GPU communication bandwidth (NVLink vs PCIe) and its impact on performance.
Open‑source engines vLLM and SGLang were benchmarked on Zhihu’s hardware; SGLang showed strong performance and stability, while vLLM lagged in some scenarios.
Zhihu’s ZhiLight framework implements compute‑communication overlap and communication data quantization (FP16→INT8) to cut per‑layer latency from 19 ms to 12 ms and further to 10 ms, achieving roughly a 40 % overall speedup.
Benchmark results indicate ZhiLight consistently outperforms vLLM and SGLang across model sizes, especially on PCIe‑based GPUs such as RTX 4090.
Future work focuses on multi‑GPU inference on PCIe devices, extending to NVLink/RDMA‑based PD‑separated architectures and supporting multimodal models.
Q&A addressed challenges of first‑token latency and load balancing under high concurrency, proposing compute‑communication overlap and Prefill‑Decode separation as solutions.
Zhihu Tech Column
Sharing Zhihu tech posts and exploring community technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.