Artificial Intelligence 11 min read

Why Liquid‑Cooled Cold‑Plate Designs Are Critical for PCIe AI Accelerators

This whitepaper explains how liquid‑cooled cold‑plate technology addresses the high power density of PCIe‑based AI accelerator cards, outlines standardized design requirements, and provides detailed guidelines for thermal, mechanical, and reliability aspects to improve data‑center PUE and enable greener AI servers.

Architects' Tech Alliance

Aug 27, 2024

Why Liquid‑Cooled Cold‑Plate Designs Are Critical for PCIe AI Accelerators

Background

AI servers consume massive power, and traditional air‑cooling cannot meet the thermal demands of modern PCIe‑based AI accelerator cards. Liquid‑cooled cold‑plate solutions have become the mainstream approach to reduce data‑center Power Usage Effectiveness (PUE) and improve reliability.

Current Landscape

While many vendors have released proprietary cold‑plate AI accelerator designs, they lack standardization across the industry, making integration and scaling difficult.

AI Server Composition

An AI server typically consists of a general‑purpose compute subsystem, heterogeneous acceleration subsystem (GPGPU, AI ASIC, FPGA), storage, interconnect, monitoring, power, structural and thermal subsystems, and I/O devices. The acceleration subsystem requires high‑performance cooling.

Cold‑Plate AI Server Design

Both CPUs and AI accelerator cards should use cold plates for heat removal, while other high‑power components (e.g., memory) should also adopt cold‑plate cooling to maximize liquid‑cooling coverage and lower PUE. Designs must minimize internal tubing and incorporate leak‑detection mechanisms for reliability.

Cold‑Plate AI Accelerator Card Composition

A cold‑plate AI accelerator card consists of the accelerator board, a liquid‑cooled cold plate, and the card’s enclosure. The cold plate covers the main chip and all heat‑generating components (e.g., VR, memory) to improve heat‑transfer efficiency. The card provides a pair of fluid quick‑connect ports (two male connectors) for integration with the server’s cooling loop.

Design Requirements for AI Accelerator Cards

Design the cooling solution based on the AI chip’s size, thermal characteristics, and the host system’s internal layout to achieve high heat‑transfer efficiency while keeping flow resistance low.

Ensure the design meets the mechanical load and other structural requirements of the AI chip socket.

Consider pipe routing, inlet/outlet locations, and avoid interference with other electronic components.

Use copper or aluminum alloy for the cold‑plate substrate and flow channels; avoid mixing metals with large electro‑potential differences within the same cooling loop.

Select coolant compatible with all materials it contacts in the secondary loop.

Leak‑detection sensors must trigger at a leakage volume ≤ 0.5 ml.

Cold‑plate weight must satisfy chip‑specified limits.

Design the installation and removal sequence to comply with chip handling procedures.

Ensure the mounting surface flatness and chip‑to‑cold‑plate pressure meet the required technical specifications.

Cold‑Plate Design Requirements for the Accelerator

Material should have high thermal conductivity and chemical compatibility with the coolant (e.g., copper); the cold plate must fully cover all heat sources.

Fixation should use four spring screws.

Pressure between the main chip and cold plate must satisfy thermal performance needs.

Contact surfaces must be smooth; flatness ≤ 0.05 mm, roughness Ra ≤ 1.6 µm.

Standard PCIe Interface Considerations

Accelerator cards should avoid perforating the PCIe slot shroud; the cold plate must fully cover the card for complete liquid‑cooling. Quick‑connect ports can be placed on the card’s side or rear, and the card dimensions must comply with PCIe CEM specifications (single‑ or double‑slot, full‑height, length ≤ 266.7 mm).

Thermal Performance Parameters

The whitepaper provides measured thermal performance data for various AI accelerator cards, demonstrating the effectiveness of cold‑plate liquid cooling in reducing hotspot temperatures and overall system fan power.

Thermal performance chart of AI accelerator cards

Conclusion

Standardized cold‑plate liquid‑cooling designs for PCIe‑based AI accelerator cards are essential to achieve efficient thermal management, lower data‑center PUE, and support the widespread adoption of high‑performance AI workloads. The guidelines presented aim to unify design practices across vendors, reduce integration effort, and promote greener data‑center operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

liquid cooling AI accelerator PCIe cold plate thermal design

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.