Artificial Intelligence 11 min read

Online Deep Learning (ODL) Model Optimization for Real‑Time Recommendation

The team enhanced real‑time recommendation by redesigning TensorFlow graphs—using constant‑folding, a custom CallGraphOP cache, a simplified dense layer, and CUDA‑Graph compatibility—boosting single‑machine throughput ~40%, raising GPU utilization from 30% to 43%, cutting latency and saving roughly 30% of hardware resources.

DaTaobao Tech

Sep 7, 2022

Online Deep Learning (ODL) Model Optimization for Real‑Time Recommendation

Online Deep Learning (ODL) enables real‑time updates of deep models, which is crucial for content recommendation where user behavior changes rapidly. Traditional daily batch training lags behind business needs, leading to GPU under‑utilization and service instability when GPU usage reaches only ~30%.

To address this, the team applied several optimizations:

ConstantFolding Optimization – TensorFlow’s built‑in constant folding merges constant operations (e.g., replacing C = A + B with C = 5). In recommendation models this mainly affects BatchNormalization nodes. However, ODL’s weight nodes (WeightsOp) prevent constant folding, causing unnecessary matrix operations and a launch‑bound bottleneck.

CallGraphOP Optimization – By extracting the non‑foldable nodes into a separate sub‑graph (ConstantFoldable), wrapping it with a custom CallGraphOP operator, and rerouting tensors, the model can reuse cached results and only invoke the sub‑graph when needed.

The following C++ snippet shows the cache‑decision logic used in CallGraphOP:

// Determine whether to use local cache or invoke sub‑graph
bool useCached() {
    // ...
    int64 curTime = TimeUtility::currentTimeInSeconds();
    // _countInterval and _timeInterval are configurable
    if (_currCount++ % _countInterval == 0 || curTime - _lastTime >= _timeInterval) {
        _lastTime = curTime;
        return false;
    }
    return true;
}

Fully‑Connected Network Simplification – The original TensorFlow 1.x implementation of keras.layers.Dense creates a complex graph with tensordot and Gather ops that are incompatible with CUDA Graph. The team replaced it with a streamlined Reshape‑MatMul‑Reshape pattern, eliminating unsupported ops and allowing CUDA Graph to capture the entire computation.

CUDA Graph Compatibility – After enabling CUDA Graph, crashes were traced to CallGraphOP, which is not supported. Adding CallGraphOP to the CUDA Graph blacklist excluded it from capture, preserving stability.

Results – The combined optimizations increased single‑machine throughput by ~40%, raised peak GPU utilization from 30% to 43%, reduced model latency (RT and P99), and saved ~30% of hardware resources while supporting rapid ODL model iteration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Optimization TensorFlow Recommendation Systems CUDA Graph GPU performance online deep learning

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.