How We Won the DeepRec CTR Contest: 36% Faster Training with Operator Tweaks

The NicePerf team, after clinching the top spot in the Tianchi DeepRec CTR model performance competition, shares a detailed walkthrough of their CPU‑only training optimizations—including operator selection, custom C++ kernels, and workflow tweaks—that cut overall training time by over a third.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How We Won the DeepRec CTR Contest: 36% Faster Training with Operator Tweaks

The NicePerf team (Li Yang and Guo Lin) won the Tianchi DeepRec CTR model performance optimization contest (rank 1/3802) and present a comprehensive recap of their experience.

Background

The competition required speeding up single‑machine CPU training for six classic models (WDL, DeepFM, DLRM, DIN, DIEN, MMoE) built on the DeepRec framework (TensorFlow 1.15). Distributed training and I/O optimizations were unavailable, so the team focused on model‑level and framework‑level tweaks.

Optimization Overview

After profiling per‑step latency, DeepFM emerged as the biggest bottleneck, followed by DIEN. The team applied a series of low‑hanging‑fruit optimizations and deeper operator‑level improvements.

(1) IndicatorColumn Operator Selection (DeepFM)

Analysis showed the OneHot operator, originating from IndicatorColumn, dominated runtime. By creating a subclass IndicatorColumnV2 that replaces the sparse‑tensor‑to‑dense + one_hot + reduce_sum pipeline with a single tf.scatter_nd call, step time dropped from 500 ms to 75 ms.

tf.scatter_nd(indices=multi‑hot_nonzero_indices, updates=all_ones_vector, shape=original_sparse_tensor_dense_shape)

Performance tables (see images) confirm the reduction, and the ConcatV2 operator moved from rank 9 to rank 1, indicating better parallelism.

(2) RNN Cell Fusion (DIEN)

The DIEN model contains GRU and VecAttGRU layers. The team wrote custom C++ forward and gradient kernels for both cells, replacing the original sub‑graph of multiple operators. This reduced DIEN’s total runtime by ~67.96 seconds.

(3) Attention Layer Optimization (DIN & DIEN)

Both models suffer from excessive padding in the attention layer due to variable‑length user histories. By recombining operators to eliminate padded positions, the tensor fed to the MLP changed from shape [B,T,4C] to [N,4C], saving 141.38 seconds.

(4) Sequence Feature Parsing Fusion

Many tiny operators formed sub‑graphs for sequence features (e.g., computing mean of a history_price string). The team implemented two C++ ops— SparseSequenceLength and StringSplitToNumberAndMean —to replace these sub‑graphs, cutting about 88.47 seconds.

(5) Workflow Scheduling Optimizations

Two workflow tweaks were added: an asynchronous checkpoint saver hook ( AsynchronousCheckpointSaverHook) reduced checkpoint overhead by 31.7 seconds, and multi‑threaded hyper‑parameter tuning (adjusting intra_op and inter_op thread counts, stage prefetch settings, etc.) saved another 33.4 seconds.

Overall Impact and Summary

All optimizations together reduced total training time from 2063.25 seconds to 1307.17 seconds, a 36.65% improvement.

The main techniques were operator selection and operator fusion, often implemented via custom macro ops. Additional gains came from asynchronous checkpointing and multithreaded hyper‑parameter tuning. The team thanks the organizers and looks forward to future DeepRec enhancements.

DeepRec open‑source repository: https://github.com/alibaba/DeepRec

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Model OptimizationTensorFlowDeepRecOperator fusiondeepfmDIENCPU training
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.