How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization

The Inspur Metabrain R1 inference server, equipped with FP8 acceleration and a 1128 GB HBM3e memory pool, has been tightly integrated with SGLang 0.4.3 to run the 671‑billion‑parameter DeepSeek R1 model, delivering over 1,000 concurrent user sessions and up to 3,976 tokens/s throughput.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
How Inspur Metabrain R1 Server Enables 1000+ Concurrent Users for DeepSeek 671B via SGLang Optimization

DeepSeek R1 is a 671‑billion‑parameter model that uses MLA attention and a hybrid Mixture‑of‑Experts (MoE) architecture, presenting significant inference‑service performance challenges. The Inspur team tackled these challenges by jointly optimizing AI server hardware and the inference framework.

The Metabrain R1 inference server (model NF5688G7) natively incorporates an FP8 compute engine, 1128 GB of HBM3e high‑speed memory, and a memory bandwidth of 4.8 TB/s, satisfying the FP8 precision requirement of at least 800 GB memory for the 671B model while leaving ample KV‑cache space. GPU‑to‑GPU (P2P) bandwidth reaches 900 GB/s, ensuring optimal tensor‑parallel communication.

SGLang, an emerging open‑source inference framework, received targeted engineering optimizations for MLA attention and MoE inference. Version 0.4.3 of SGLang was fully adapted to the NF5688G7 platform through hardware tuning, operator optimization, hybrid parallelism, and multi‑token prediction techniques.

Benchmark results on the adapted system show a single‑user decoding speed of 33 tokens/s. Concurrency scaling demonstrates approximately 20 tokens/s per user at 16 users, 10.4 tokens/s per user at 64 users, and a total throughput of 3,975.76 tokens/s with 1,024 concurrent users, confirming the server’s capability to support ultra‑high‑concurrency scenarios.

These achievements underline the effectiveness of co‑designing hardware and software for large‑scale MoE models and position the Metabrain R1 server as a high‑performance, cost‑effective solution for deploying DeepSeek‑type models in production.

Server hardware diagram
Server hardware diagram
Single‑user performance log
Single‑user performance log
1024‑user concurrency log
1024‑user concurrency log
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Inference Optimizationperformance benchmarkDeepSeekMoESGLangAI server
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.