Fundamentals 16 min read

High Bandwidth Memory (HBM) Technology Overview and Its Integration in Modern Processors

High Bandwidth Memory (HBM), introduced in 2014 using TSV stacking, has evolved through HBM2, HBM2e, and HBM3 standards and is now integrated into CPUs, GPUs, and accelerators from AMD, NVIDIA, Intel, and others, with advanced interconnects like CoWoS, EMIB, and Foveros enabling high‑capacity, high‑bandwidth packaging.

Architects' Tech Alliance

Oct 6, 2023

High Bandwidth Memory (HBM) Technology Overview and Its Integration in Modern Processors

This article originates from the 2023 New‑type Computing Center Research Report and references additional reports on Hai Guang CPU+DCU and Loongson CPU technologies.

HBM (High Bandwidth Memory) was jointly announced by AMD and SK Hynix in 2014, employing TSV technology to stack multiple DRAM dies, dramatically increasing capacity and data‑transfer rates.

Subsequent participation by Samsung, Micron, NVIDIA, Synopsys and others led JEDEC to standardize HBM2 (JESD235A) and later iterate to HBM2e (JESD235B) and HBM3 (JESD235C). The stacked package and massive 1024‑bit width give HBM far higher bandwidth and capacity than DDR, LPDDR, or GDDR memories.

The typical implementation connects HBM to processor cores via 2.5D packaging, used in CPUs, GPUs and other products. Early views treated HBM as an L4 cache; from a bandwidth perspective this is reasonable, while its capacity far exceeds SRAM or eDRAM, allowing it to serve both cache‑like and high‑performance memory roles.

AMD was an early HBM adopter; the Instinct MI250X accelerator integrates two compute cores and eight HBM2e stacks for a total of 128 GB capacity and 3276.8 GB/s bandwidth.

NVIDIA’s professional GPUs have used HBM since the 2016 Tesla P100 (16 GB HBM2). Subsequent V100 (32 GB HBM2), A100 (up to 80 GB HBM2e, ~2 TB/s), and H100 (HBM3, ~3.9 TB/s) continue this trend.

Huawei’s Ascend 910 processor integrates four HBM stacks, and HBM is also mature in compute cards, SmartNICs, and high‑speed FPGAs.

Fujitsu’s A64FX CPU, used in the TOP500‑ranked Fugaku supercomputer, integrates four HBM2 stacks (32 GB total) delivering 1 TB/s bandwidth.

Intel’s Xeon Max series (released Jan 2023) pairs the fourth‑generation Xeon Scalable processor with 64 GB HBM2e, usable in “HBM Only” mode or combined with DDR5 in “HBM Flat” and “HBM Caching” modes.

Interposer Layer: CoWoS and EMIB

HBM must be assembled with processors via a silicon interposer because traditional organic substrates cannot handle the ultra‑high‑density contacts and high‑frequency signals. Two main silicon‑interposer technologies are TSMC’s CoWoS (chip‑on‑wafer‑on‑substrate) and Intel’s EMIB (Embedded Multi‑Die Interconnect Bridge).

CoWoS‑S uses a silicon interposer that fully supports the processor and HBM dies; its size is dictated by the combined die area, typically limiting the number of HBM stacks to four. The interposer is fabricated in a mature 65 nm process, keeping cost moderate but constrained by mask size. Early HBM adoption was limited by this area constraint until TSMC’s 2016 breakthrough, which allowed a 1.5× mask‑size interposer and mainstream four‑HBM configurations.

TSMC announced a 2× mask‑size interposer in 2019, enabling six HBM stacks. NEC’s SX‑Aurora TSUBASA (2020) integrated six HBM2 (48 GB), and NVIDIA’s A100 (2020) used six HBM2e (40 GB, one stack disabled). Future chips may pack up to twelve HBM stacks, requiring ~3200 mm² silicon area, after which wafer‑cutting efficiency becomes the next bottleneck.

Intel’s EMIB uses a much smaller silicon bridge. In the fourth‑generation Xeon Scalable renderings, brown squares represent EMIB bridges that connect four XCC dies into a single package; in Xeon Max, each die also uses EMIB to attach its HBM chips. Intel’s approach connects only the memory and processor PHYs via EMIB, leaving other signals on the organic substrate, which reduces cost but adds assembly complexity.

Intel’s Data Center Max GPU series combines Foveros 3D stacking with EMIB 2.5D packaging: a base die (650 mm², Intel 7 process) hosts high‑speed I/O PHYs (HBM, Xe Link, PCIe 5.0) and cache, while eight compute tiles (TSMC N5) and four RAMBO tiles (Intel 7) are stacked on top. Each compute tile has 4 MB L1 cache; each RAMBO tile provides 15 MB L3 cache, and the base die adds 144 MB cache, yielding 204 MB L2/L3 per group and 408 MB total.

The Max GPU’s L2 cache bandwidth reaches 13 TB/s; accounting for the two groups of chips, each group provides about 6.5 TB/s, far surpassing current Xeon and EPYC L2/L3 bandwidths.

Integrating cache into the base die also improves thermal management by placing high‑power compute cores on top. In mesh processor architectures, L3 cache is distributed across many nodes; the base die can host SRAM units that connect directly to these mesh nodes, leveraging 30‑50 µm bump pitches for thousands of connections per mm².

HBM serves as a GDDR alternative for GPUs and accelerators. While GDDR uses many narrow channels, HBM provides eight wide 128‑bit channels operating at ~2 Gbps, delivering higher throughput with lower power and a smaller footprint. HBM2 supports up to 2.4 Gbps per pin, with typical 4‑ or 8‑GB per stack.

HBM2 introduces pseudo‑channel mode (splitting each 128‑bit channel into two 64‑bit sub‑channels) and optional ECC (16 bits per 128‑bit data). Samsung’s latest HBM2 chips embed an AI processor capable of 1.2 TFLOPS, enabling on‑memory compute for tasks usually handled by CPUs, GPUs, ASICs, or FPGAs.

Disclaimer: The article is reproduced with attribution to the original author. For copyright issues, please contact us.

Recommended Reading: Additional architecture‑related knowledge summaries are available in the “Architect’s Full‑Store Technical Materials Pack” (41 books).

Special offers for the full‑store e‑book bundles (server fundamentals, storage fundamentals, etc.) are listed, with discounted pricing and free updates for purchasers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CPU GPU HBM high bandwidth memory Interposer

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.