Why Huawei’s Ascend 950 PR and DT Have Different Names – The Technical Rationale
Huawei’s Ascend 950 series splits a single die into two variants—PR (Prefill & Recommendation) optimized for compute‑intensive inference with low cost, and DT (Decode & Training) tuned for memory‑bandwidth‑heavy generation and training—illustrating a scenario‑driven, P/D‑separated architecture that maximizes efficiency.
Huawei’s Ascend 950 family adopts a "one‑chip, dual‑architecture" strategy, offering two variants distinguished by the suffixes PR and DT.
What are PR and DT?
PR stands for Prefill & Recommendation . In this mode the chip processes an entire prompt in parallel, builds a KV cache, and quickly produces the first token. It is compute‑intensive, favors low latency and high throughput, and suits workloads such as e‑commerce recommendation.
Decode and Training (DT)
DT stands for Decode & Training . This mode generates tokens sequentially and supports large‑scale model training. It is memory‑bandwidth‑heavy, requiring large capacity and high bandwidth to handle massive parameter reads and writes.
Why two variants?
Large‑model inference consists of two fundamentally different phases. Using a single chip for both would be like making a sprinter run a marathon—neither phase would be optimal. By separating the functions, each variant can be tuned for its dominant resource: compute for PR and bandwidth/capacity for DT.
Technical specifications
Both variants share the same 950 core die.
PR uses Huawei‑designed HiBL 1.0 HBM, offering 128 GB memory with 1.6 TB/s bandwidth.
DT uses Huawei‑designed HiZQ 2.0 HBM, offering 144 GB memory with up to 4 TB/s bandwidth.
PR is positioned for fast first‑token response, strong concurrency and cost‑effectiveness.
DT is positioned for stable long‑text generation and high‑speed training without bandwidth bottlenecks.
Naming logic
The suffixes are purposeful: PR = Prefill + Recommendation (compute‑first, cost‑focused) and DT = Decode + Training (bandwidth‑first, performance‑focused). This reflects a P/D (Prefill/Decode) separation that assigns dedicated silicon to each stage, avoiding resource contention and achieving optimal power‑efficiency, latency and cost.
Broader implications
The design demonstrates architectural maturity: rather than merely increasing parameters or raw FLOPS, Huawei partitions the workload by scenario and load. The 950 family can handle both inference and training while keeping costs low. For users it means paying only for the performance they need; for the industry it signals a pragmatic, scenario‑driven AI‑chip roadmap.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
