Online Deep Learning (ODL) Model Optimization for Real‑Time Recommendation
The team enhanced real‑time recommendation by redesigning TensorFlow graphs—using constant‑folding, a custom CallGraphOP cache, a simplified dense layer, and CUDA‑Graph compatibility—boosting single‑machine throughput ~40%, raising GPU utilization from 30% to 43%, cutting latency and saving roughly 30% of hardware resources.
Online Deep Learning (ODL) enables real‑time updates of deep models, which is crucial for content recommendation where user behavior changes rapidly. Traditional daily batch training lags behind business needs, leading to GPU under‑utilization and service instability when GPU usage reaches only ~30%.
To address this, the team applied several optimizations:
ConstantFolding Optimization – TensorFlow’s built‑in constant folding merges constant operations (e.g., replacing C = A + B with C = 5). In recommendation models this mainly affects BatchNormalization nodes. However, ODL’s weight nodes (WeightsOp) prevent constant folding, causing unnecessary matrix operations and a launch‑bound bottleneck.
CallGraphOP Optimization – By extracting the non‑foldable nodes into a separate sub‑graph (ConstantFoldable), wrapping it with a custom CallGraphOP operator, and rerouting tensors, the model can reuse cached results and only invoke the sub‑graph when needed.
The following C++ snippet shows the cache‑decision logic used in CallGraphOP:
// Determine whether to use local cache or invoke sub‑graph bool useCached() { // ... int64 curTime = TimeUtility::currentTimeInSeconds(); // _countInterval and _timeInterval are configurable if (_currCount++ % _countInterval == 0 || curTime - _lastTime >= _timeInterval) { _lastTime = curTime; return false; } return true; }
Fully‑Connected Network Simplification – The original TensorFlow 1.x implementation of keras.layers.Dense creates a complex graph with tensordot and Gather ops that are incompatible with CUDA Graph. The team replaced it with a streamlined Reshape‑MatMul‑Reshape pattern, eliminating unsupported ops and allowing CUDA Graph to capture the entire computation.
CUDA Graph Compatibility – After enabling CUDA Graph, crashes were traced to CallGraphOP, which is not supported. Adding CallGraphOP to the CUDA Graph blacklist excluded it from capture, preserving stability.
Results – The combined optimizations increased single‑machine throughput by ~40%, raised peak GPU utilization from 30% to 43%, reduced model latency (RT and P99), and saved ~30% of hardware resources while supporting rapid ODL model iteration.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.