Open Acceleration Specification AI Server Design Guide (2023): Architecture, OAM Modules, UBB Board, and System Design
The 2023 Open Acceleration Specification AI Server Design Guide details the hardware architecture, OAM module and UBB board specifications, cooling, management, fault diagnosis, and software platform needed to build high‑performance, scalable AI compute clusters for large‑model training.
Large language models now exceed hundreds of billions of parameters and require training datasets at the terabyte scale; for example, GPT‑3 (175 B parameters) consumes 3640 PetaFlop·s‑day of compute, demanding AI servers with kilowatt‑level AI chips.
The OCP OAI group defined an open acceleration card form factor (OAI‑UBB) in 2019 to support higher power and bandwidth AI cards, releasing the OAI‑UBB1.0 specification that enables interoperable hardware across vendors.
For generative AI, system architecture moves beyond single servers to integrated clusters that combine compute, storage, networking, software, cooling, and power into a highly integrated AI compute fabric, aiming to improve deployment efficiency, stability, and availability.
The OAM v1.5 specification provides GPU node acceleration, supporting 7P×8 Fully‑Connected and 6P×8 Hybrid Cubic Mesh interconnect topologies, PCIe Switch links (4 × PCIe x16) to CPUs, and up to 32 DIMM slots, ensuring high‑bandwidth CPU‑OAM communication.
The UBB board hosts up to eight OAM modules, measures 16.7 × 21 inches, fits 19‑ or 21‑inch racks, and offers configurable interconnect topologies while limiting external links to ×8, with port 1H reserved for expansion.
Hardware design must follow UBB and OAM electrical, timing, and layout specifications, ensuring proper placement and routing of components.
Interconnect loss evaluation targets the 56 Gbps PAM signal with total loss below 30 dB at base frequency, TX/RX loss under 8 dB, and cable loss under 5 dB for QSFP‑DD assemblies.
Cooling design employs high‑efficiency fan walls, side‑flow, multi‑duct isolation for OAM and CPU zones, hot‑swap fans, N+1 redundancy, and thermal‑resistance calculations for UBB and OAM modules.
System management provides asset information, register access, firmware updates, and out‑of‑band monitoring of voltage, power, temperature, ECC status, PCIe errors, and other health metrics.
Fault diagnosis includes detection of uncorrectable ECC errors, PCIe bus errors, ESL connection anomalies, and card loss, with BMC‑based monitoring and link‑level diagnostics.
The software platform creates a unified AI compute resource pool using containerization and scheduling to abstract hardware differences, applying adaptive policies and agile frameworks for precise resource allocation.
The guide aggregates detailed specifications for architecture, OAM modules, UBB boards, hardware, cooling, management, diagnostics, and software, and provides download links for the full documentation.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.