Memory Leak Diagnosis and Fixes for TensorFlow Serving in iQIYI’s Deep Learning Platform
The iQIYI deep‑learning platform identified two TensorFlow Serving memory‑leak problems—a string‑accumulating executor map caused by unordered input maps and an uncontrolled gRPC thread surge under heavy load—submitted upstream patches that sort inputs and cap thread counts, eliminating OOM crashes and stabilizing production.
TensorFlow Serving is a high‑performance inference system open‑sourced by Google and widely used in click‑through‑rate (CTR) prediction scenarios because of its stability and convenience. However, the iQIYI deep‑learning platform observed that the serving containers sometimes experience continuous memory growth, eventually leading to OOM (out‑of‑memory) crashes.
The article describes two distinct memory‑leak problems discovered in TensorFlow Serving, the root causes, and the patches submitted to the upstream projects.
Background
TensorFlow Serving supports both gRPC and HTTP interfaces, multi‑model and multi‑version deployments, and hot model updates. iQIYI also open‑sourced XGBoost Serving , which inherits these features.
In production, the service frequently receives OOM reports from the container runtime, and profiling with gperftools revealed that memory usage keeps increasing without bound.
Issue 1 – DirectSession::GetOrCreateExecutors Memory Leak
The profiling showed that the function DirectSession::GetOrCreateExecutors creates a large number of String objects via an unordered_map (named executors_ ) that maps model signatures to ExecutorsAndKeys . When the number of input features grows (e.g., 10! ≈ 3.6 million combinations), each unique combination generates a new entry, consuming hundreds of megabytes or even gigabytes of memory.
The underlying cause is that the PredictRequest sent to TensorFlow Serving contains a map<string, TensorProto> for inputs . The order of map entries is undefined in Protocol Buffers, and a recent client change caused the feature order to vary between requests. Because the order changed, the server could not find a matching entry in executors_ and kept inserting new strings, leading to the leak.
To fix the problem, two pull requests were submitted:
tensorflow/tensorflow#39743 – modifies the GetOrCreateExecutors implementation.
tensorflow/serving#1638 – sorts the inputs map inside TensorFlow Serving before processing, eliminating the need for repeated string concatenations.
Issue 2 – gRPC Thread Explosion Under High Concurrency
During traffic spikes, the serving containers also experienced OOM due to a rapid increase in the number of gRPC worker threads. Monitoring showed that the number of grpcpp_sync_ser threads grew dramatically, eventually exhausting container memory.
Investigation revealed that the gRPC server creates a new worker thread for each incoming request when the resource quota is not reached. Under heavy load, many threads remain alive simultaneously, consuming large amounts of memory.
The resolution was to add a maximum thread limit to the gRPC server. A pull request was submitted to TensorFlow Serving:
tensorflow/serving#1785 – introduces a resource‑quota‑based limit on the number of gRPC threads.
The authors recommend that any code using a gRPC server should enforce a maximum thread count to prevent service collapse during traffic bursts.
Conclusion
The two memory‑leak issues were fixed, and the TensorFlow Serving service has been stable in production since the patches were merged. The article also provides references to the relevant GitHub repositories and documentation.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.