How Baidu Cloud Slashes Inference Costs: DeepSeek Model Optimizations Unveiled

Baidu Cloud's Qianfan platform launched DeepSeek‑R1 and DeepSeek‑V3 with ultra‑low inference pricing, leveraging advanced engine performance tweaks, a split Prefill/Decode architecture, and comprehensive security measures that together boost throughput, cut costs, and ensure enterprise‑grade reliability.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
How Baidu Cloud Slashes Inference Costs: DeepSeek Model Optimizations Unveiled

Background and Launch

On February 3, Baidu Intelligent Cloud’s Qianfan large‑model platform introduced DeepSeek‑R1 and DeepSeek‑V3, attracting over 15,000 customers on day one. The service offers inference prices as low as 30‑50% of the official DeepSeek rates, with a limited‑time free tier.

Inference Engine Performance Optimization

Building on Baidu’s extensive experience in large‑model inference, the team optimized the MLA structure of DeepSeek models to achieve extreme performance gains. By overlapping compute, communication, and memory operators and employing a high‑efficiency Prefill/Decode split architecture, the system meets SLA targets for TTFT and TPOT while dramatically increasing throughput and reducing inference cost.

Engineering Architecture Innovations

The platform adopts a push‑pull model for request handling, which outperforms traditional pull‑only designs in success rate, latency, and throughput. A novel request‑failure continuation mechanism improves fault tolerance and SLA compliance. KV‑Cache reuse and a global‑cache‑aware traffic scheduling strategy eliminate redundant token calculations, further lowering latency and boosting throughput.

Stability and Security Guarantees

Leveraging Baidu’s proprietary content‑security operators, Qianfan provides enterprise‑grade high‑availability and data‑life‑cycle protection. Specialized security optimizations ensure that DeepSeek‑R1 and DeepSeek‑V3 remain safe for enterprise usage, with end‑to‑end safeguards across the model’s lifecycle.

Platform Capabilities

Qianfan ModelBuilder offers an end‑to‑end AI service suite, including data preprocessing, model fine‑tuning, evaluation, and quantization. It supports major inference frameworks such as vLLM, LMDeploy, TensorRT‑LLM, and SGLang, and allows custom model import and deployment for flexible development.

Future Outlook

Recently, Baidu illuminated its Kunlun‑P800 10‑k‑card cluster—the first domestically built 10‑k‑card AI cluster—while planning a 30‑k‑card expansion. Ongoing technical documentation releases aim to share best practices and accelerate innovation for developers and enterprises alike.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance optimizationlarge language modelsSecurityAI inferenceModel ServingBaidu Cloud
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.