How We Tripled CTR Model Training Speed in the Alibaba‑Intel DeepRec Challenge

The MetaSpore team detailed a three‑pronged optimization—sparse model tuning, training‑pipeline acceleration, and low‑level framework tweaks—that boosted DeepRec CTR model training efficiency by over three times without sacrificing AUC, securing first place in the global AI competition.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How We Tripled CTR Model Training Speed in the Alibaba‑Intel DeepRec Challenge

Alibaba Cloud and Intel co‑hosted the “Innovation Master Cup” Global AI Geek Challenge focusing on PAI‑DeepRec CTR model performance optimization. The MetaSpore team (members from Beijing Shuyuan Ling Technology) participated, achieving over three‑fold training speed improvement while keeping AUC stable, and ranked first in both the global preliminaries and finals.

Solution Overview

The optimization was divided into three main areas:

CTR sparse model training optimization

DeepRec training acceleration parameter tuning

DeepRec framework performance tuning

1. CTR Sparse Model Training Optimization

Replace GRUCell with Faster GRUBlockCellV2

DIEN originally uses tf.nn.rnn_cell.GRUCell, which is Python‑based and serial. We switched to tf.contrib.rnn.GRUBlockCellV2, a C++ implementation with forward and backward kernels, yielding noticeable speed gains.

Explore SRU as a GRU Alternative

After improving GRU, we investigated SRU, which reduces sequential dependencies. Replacing GRU with SRU maintained AUC while cutting training time by ~80 s (see SRU paper https://arxiv.org/pdf/1709.02755.pdf ).

Simplify SRU Further (Unsubmitted)

We prototyped a simplified SRU that kept AUC unchanged and reduced runtime by ~50 s, but did not submit due to limited theoretical analysis.

2. Sparse Feature Representation Optimization

Profiling DeepFM revealed heavy OneHot operator costs. Replacing indicator_column with embedding_column accelerated feature handling.

Resulting performance gain is shown below.

3. Training Acceleration Parameter Tuning

Enable AutoMicroBatch Pipeline

AutoMicroBatch aggregates gradients over multiple micro‑batches before updating variables, improving throughput.

Wide & Deep could not use the default micro‑batch due to a conflict with tf.feature_column.linear_model. We rewrote the linear model interface to resolve the crash.

Experiments showed that setting micro_batch_num=2 for most models yields ~900 s speedup while preserving AUC. DIEN required a custom setting ( micro_batch_num=2 for DIEN, default 8 for others) to avoid AUC loss.

4. Framework Performance Tuning

Optimize Compilation Options

Using Intel‑optimized MKL thread pool and other flags improves TensorFlow performance.

bazel build -c opt --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true

This change alone saved ~130 s.

bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package

Other Low‑Level Optimizations

We explored using Microsoft’s mimalloc allocator (≈4 % time reduction) and selective MKL operator usage, though these were not fully integrated due to time constraints.

Summary

By addressing sparse model computation, feature representation, training pipeline, and compilation settings, we achieved over three‑fold reduction in training time (≈70 % overall) without compromising model accuracy, securing the top rank in the competition.

All experiments were conducted on a local 8‑core, 16 GB machine, so results may differ from production environments.

GitHub repository: https://github.com/meta-soul/DeepRec/tree/tianchi

DeepRec open‑source project: https://github.com/alibaba/DeepRec

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Model OptimizationCTRAI competitionDeepRecTraining Acceleration
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.