Artificial Intelligence 11 min read

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

Zhihu Tech Column

Mar 14, 2025

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

In the first Zhihu Tech Salon, Wang Xin, head of Zhihu's Machine Learning Platform, presented the design and deployment experience of Zhihu's large‑model inference framework ZhiLight.

The talk covered how large models run, basics of model inference, the use of open‑source inference frameworks, and practical lessons from the self‑developed ZhiLight framework.

How large models operate

Application of open‑source inference frameworks in production

Zhihu’s self‑built ZhiLight framework

Analysis of model load and GPU design shows that single‑GPU inference cannot handle the compute demand of modern LLMs, leading to multi‑GPU parallelism strategies such as pipeline (流水) parallelism, tensor parallelism, and expert parallelism. Experiments on LLaMA‑2 models demonstrated that tensor parallelism reduces latency by more than two‑fold compared with pipeline parallelism.

GPU architecture evolution (Ampere, Ada Lovelace, Hopper) and their compute capabilities were discussed, highlighting differences in inter‑GPU communication bandwidth (NVLink vs PCIe) and its impact on performance.

Open‑source engines vLLM and SGLang were benchmarked on Zhihu’s hardware; SGLang showed strong performance and stability, while vLLM lagged in some scenarios.

Zhihu’s ZhiLight framework implements compute‑communication overlap and communication data quantization (FP16→INT8) to cut per‑layer latency from 19 ms to 12 ms and further to 10 ms, achieving roughly a 40 % overall speedup.

Benchmark results indicate ZhiLight consistently outperforms vLLM and SGLang across model sizes, especially on PCIe‑based GPUs such as RTX 4090.

Future work focuses on multi‑GPU inference on PCIe devices, extending to NVLink/RDMA‑based PD‑separated architectures and supporting multimodal models.

Q&A addressed challenges of first‑token latency and load balancing under high concurrency, proposing compute‑communication overlap and Prefill‑Decode separation as solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Models vLLM tensor parallelism model inference GPU parallelism SGLang ZhiLight

Written by

Zhihu Tech Column

Sharing Zhihu tech posts and exploring community technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.