How We Tripled CTR Model Training Speed in the Alibaba‑Intel DeepRec Challenge
The MetaSpore team detailed a three‑pronged optimization—sparse model tuning, training‑pipeline acceleration, and low‑level framework tweaks—that boosted DeepRec CTR model training efficiency by over three times without sacrificing AUC, securing first place in the global AI competition.
Alibaba Cloud and Intel co‑hosted the “Innovation Master Cup” Global AI Geek Challenge focusing on PAI‑DeepRec CTR model performance optimization. The MetaSpore team (members from Beijing Shuyuan Ling Technology) participated, achieving over three‑fold training speed improvement while keeping AUC stable, and ranked first in both the global preliminaries and finals.
Solution Overview
The optimization was divided into three main areas:
CTR sparse model training optimization
DeepRec training acceleration parameter tuning
DeepRec framework performance tuning
1. CTR Sparse Model Training Optimization
Replace GRUCell with Faster GRUBlockCellV2
DIEN originally uses tf.nn.rnn_cell.GRUCell, which is Python‑based and serial. We switched to tf.contrib.rnn.GRUBlockCellV2, a C++ implementation with forward and backward kernels, yielding noticeable speed gains.
Explore SRU as a GRU Alternative
After improving GRU, we investigated SRU, which reduces sequential dependencies. Replacing GRU with SRU maintained AUC while cutting training time by ~80 s (see SRU paper https://arxiv.org/pdf/1709.02755.pdf ).
Simplify SRU Further (Unsubmitted)
We prototyped a simplified SRU that kept AUC unchanged and reduced runtime by ~50 s, but did not submit due to limited theoretical analysis.
2. Sparse Feature Representation Optimization
Profiling DeepFM revealed heavy OneHot operator costs. Replacing indicator_column with embedding_column accelerated feature handling.
Resulting performance gain is shown below.
3. Training Acceleration Parameter Tuning
Enable AutoMicroBatch Pipeline
AutoMicroBatch aggregates gradients over multiple micro‑batches before updating variables, improving throughput.
Wide & Deep could not use the default micro‑batch due to a conflict with tf.feature_column.linear_model. We rewrote the linear model interface to resolve the crash.
Experiments showed that setting micro_batch_num=2 for most models yields ~900 s speedup while preserving AUC. DIEN required a custom setting ( micro_batch_num=2 for DIEN, default 8 for others) to avoid AUC loss.
4. Framework Performance Tuning
Optimize Compilation Options
Using Intel‑optimized MKL thread pool and other flags improves TensorFlow performance.
bazel build -c opt --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=trueThis change alone saved ~130 s.
bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_packageOther Low‑Level Optimizations
We explored using Microsoft’s mimalloc allocator (≈4 % time reduction) and selective MKL operator usage, though these were not fully integrated due to time constraints.
Summary
By addressing sparse model computation, feature representation, training pipeline, and compilation settings, we achieved over three‑fold reduction in training time (≈70 % overall) without compromising model accuracy, securing the top rank in the competition.
All experiments were conducted on a local 8‑core, 16 GB machine, so results may differ from production environments.
GitHub repository: https://github.com/meta-soul/DeepRec/tree/tianchi
DeepRec open‑source project: https://github.com/alibaba/DeepRec
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
