How Intel BF16 with IPEX and oneDNN Boosts PyTorch Performance
This article explains how Intel and Facebook's BF16 support, combined with the Intel Extension for PyTorch (IPEX) and oneDNN, automates type and layout conversions and adds graph‑fusion optimizations, delivering 1.4×‑4.3× inference and up to 2.4× training speedups on Xeon CPUs for models such as DLRM, BERT‑Large, and ResNext‑101‑32x4d.
Introduction
Intel and Facebook enabled BF16 as a first‑class data type in PyTorch. BF16 operations are accelerated by oneAPI Deep Neural Network Library (oneDNN, formerly MKL‑DNN) which provides multithreaded, vectorized CPU kernels. The third‑generation Intel® Xeon® Scalable processors (code‑named Cooper Lake) add AVX‑512 BF16 instructions that fuse BF16→FP32 and FP32→BF16 in a single FMA, theoretically doubling throughput compared with FP32 FMA.
Background
Intel Extension for PyTorch (IPEX) is an open‑source PyTorch extension maintained by Intel and released as part of the Intel® AI Analytics Toolkit. IPEX extends PyTorch’s scheduling mechanism to automatically handle type and layout conversions, provides a graph‑optimization pass, and supplies custom kernels built on oneDNN.
API‑driven performance gains on Intel hardware.
Graph‑optimization channel and custom kernels that maximize hardware efficiency.
Custom composite operations for key DL modules such as the interaction layer in DLRM.
oneDNN is an open‑source, cross‑platform performance library optimized for Intel CPUs, GPUs, and Xe‑based accelerators. It supports FP32, BF16, and INT8 data types and offers fused kernels for common DL patterns (e.g., conv + relu, linear + gelu).
Easy‑to‑Use IPEX API
IPEX provides a three‑step user‑facing API that enables BF16 and oneDNN optimizations on CPU without code‑level type or layout changes:
Import the intel_pytorch_extension Python module to register IPEX optimizations.
Call ipex.enable_auto_mixed_precision(mixed_dtype=torch.bfloat16) to enable BF16 automatic mixed precision (layout conversion is enabled by default).
Move the model and input tensors to ipex.DEVICE to run with IPEX optimizations.
Code Examples
import torch
import intel_pytorch_extension as ipex
from my_models import SomeModel
# Step 1: Register IPEX optimizations
# Step 2: Enable BF16 auto‑mixed‑precision
ipex.enable_auto_mixed_precision(mixed_dtype=torch.bfloat16)
# Step 3: Enable IPEX optimizations
model = SomeModel().to(ipex.DEVICE).eval()
model = torch.jit.script(model)
# Inference example
out = model(input_tensor.to(ipex.DEVICE)) import torch
import intel_pytorch_extension as ipex
from my_models import SomeModel
ipex.enable_auto_mixed_precision(mixed_dtype=torch.bfloat16)
model = SomeModel().to(ipex.DEVICE)
optimizer = torch.optim.SGD(model.parameters(), ...)
for inputs, labels in data_loader:
inputs, labels = inputs.to(ipex.DEVICE), labels.to(ipex.DEVICE)
preds = model(inputs)
loss = criterion(preds, labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()Experimental Results
Benchmarks evaluate BF16 training and inference on three representative DL models—DLRM (recommendation), BERT‑Large (NLP), and ResNext‑101‑32x4d (computer vision)—using IPEX and oneDNN on Intel Xeon Platinum 8380H CPUs.
Table 1 (single‑instance training) shows BF16 speedups of 1.55×‑2.42× over FP32 on a single socket (28 cores). DLRM uses a 2 K mini‑batch on the Criteo Terabyte dataset, BERT‑Large uses a 24‑sample mini‑batch on WikiText‑2, and ResNext‑101‑32x4d uses a 128‑sample mini‑batch on ILSVRC‑2012.
Table 2 (multi‑instance inference) reports BF16 inference speedups of 1.40×‑4.26× across 8 sockets (224 instances total). The same models and datasets are used, with DLRM running 64‑sample mini‑batches per instance.
ResNext‑101‑32x4d achieves the highest acceleration because it benefits from layout‑conversion optimizations and graph‑fusion patterns such as batch‑norm folding, conv + relu fusion, and conv + add + relu fusion.
Conclusion
The latest IPEX and oneDNN releases provide automatic BF16 type conversion, layout handling, and graph‑level fusion, delivering 1.40×‑4.26× inference and 1.55×‑2.42× training speedups on Intel Xeon CPUs. Both projects are open source and part of Intel’s AI Analytics Toolkit.
Configuration Details
Hardware: Intel® Xeon® Platinum 8380H, 8 sockets, 28 cores per socket, 1536 GB DDR4 memory.
Firmware: BIOS WLYDCRB1.SYS.0017.P06.2008230904 (microcode 0x700001e).
OS: Ubuntu 20.04.1 LTS, kernel 5.4.0‑48‑generic.
Compiler: GCC 7.5.0.
Frameworks and Libraries:
PyTorch v1.5.0‑rc3 – https://github.com/pytorch/pytorch.git
Intel Extension for PyTorch (IPEX) 1.1.0 preview – https://github.com/intel/intel-extension-for-pytorch/tree/1.1.0_preview
oneDNN v1.5‑rc – https://github.com/oneapi-src/oneDNN
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
