Deploying TensorFlow 2.x Models with TensorFlow Serving: Concepts, Setup, and Usage
This guide explains the core concepts of TensorFlow Serving, shows how to prepare Docker images, save TensorFlow 2.x models in various formats, configure version policies, warm‑up models, start the service, and invoke it via gRPC or HTTP with complete code examples.
In this guide we continue the previous article on TensorFlow 1.x model deployment with TensorFlow Serving and describe how to deploy TensorFlow 2.x models.
Core concepts of TensorFlow Serving include:
Servables – abstracted model services typically exposed via HTTP REST and gRPC servers.
Sources – discover models from directories and create servable streams.
Loader – API for loading and unloading servables.
Aspired version – the set of versions that should be loaded, provided by sources and managed by the manager.
Manager – controls the full lifecycle of servables, applying a version policy.
VersionPolicy – default policies are Availability Preserving (always keep at least one version loaded) and Resource Preserving (load only one version to save resources).
ServableHandler – handles client requests for a specific servable.
The servable lifecycle proceeds as follows: a source creates a loader for a specific version, notifies the manager, the manager decides (based on the version policy) whether to load a new version or unload an old one, allocates resources, and returns a handle to clients.
TensorFlow 2.x model deployment
Prepare the TensorFlow Serving environment using Docker images. Pull the CPU image:
docker pull tensorflow/serving:latestPull the GPU image (requires nvidia‑docker):
docker pull tensorflow/serving:latest-devel-gpuSave models in one of the supported formats:
Checkpoint format:
model.save_weights("./xxx.ckpt", save_format="tf")H5 format:
model.save("./xxx.h5")
model.save_weights("./xxx.h5", save_format="h5")SavedModel format (recommended for TF Serving):
model.save("./xxx", save_format="tf")
tf.saved_model.save(obj, "./xxx")Inspect the SavedModel structure (variables, assets, .pb) – see the diagram in the original article.
Warm‑up model
Because TensorFlow lazily loads components, the first inference can be slow. Generate a TFRecord warm‑up file and place it in assets.extra of the model version:
# coding=utf-8
import tensorflow as tf
from tensorflow_serving.apis import model_pb2, predict_pb2, prediction_log_pb2
def main():
with tf.io.TFRecordWriter("tf_serving_warmup_requests") as writer:
request = predict_pb2.PredictRequest(
model_spec=model_pb2.ModelSpec(name="demo", signature_name='serving_default'),
inputs={"x": tf.make_tensor_proto(["warm"]), "y": tf.make_tensor_proto(["up"])})
log = prediction_log_pb2.PredictionLog(predict_log=prediction_log_pb2.PredictLog(request=request))
writer.write(log.SerializeToString())
if __name__ == "__main__":
main()Copy the generated tf_serving_warmup_requests file into the model’s assets.extra directory.
Model maintenance
Configure version policies via a model.config file. Example for loading specific versions 1 and 2:
model_config_list {
config {
name: "demo"
base_path: "/models/demo"
model_platform: "tensorflow"
model_version_policy {
specific {
versions: 1
versions: 2
}
}
}
}Start the container with the config file:
docker run -p 8500:8500 -p 8501:8501 \
--mount "type=bind,source=/home/test/ybq/model/demo,target=/models/demo" \
-e MODEL_NAME=demo \
tensorflow/serving:latest \
--model_config_file=/models/demo/model.config \
--model_config_file_poll_wait_seconds=60For multiple models, list several config blocks in the same file and mount the parent directory.
Service start
CPU model deployment command:
docker run -p 8500:8500 -p 8501:8501 \
--mount "type=bind,source=/home/test/ybq/model/demo,target=/models/demo" \
-e MODEL_NAME=demo tensorflow/serving:latestGPU model deployment command (requires --runtime nvidia ):
docker run -p 8500:8500 -p 8501:8501 \
--runtime nvidia \
--mount "type=bind,source=/home/test/ybq/model/demo,target=/models/demo" \
-e MODEL_NAME=demo tensorflow/serving:latest-gpuService call
Two interfaces are available: gRPC (default port 8500) and HTTP (default port 8501). Example gRPC client:
# coding=utf-8
import grpc, json, tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc
def test_grpc():
channel = grpc.insecure_channel('127.0.0.1:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = "demo"
request.model_spec.signature_name = "test_concat"
request.inputs['a'].CopyFrom(tf.make_tensor_proto("xxx"))
result = stub.Predict(request, 10.0)
return result.outputsExample HTTP client:
# coding=utf-8
import requests, json
def test_http():
params = json.dumps({
"signature_name": "test_concat",
"inputs": {"a": "xxx"}
})
data = json.dumps(params)
resp = requests.post('http://127.0.0.1:8501/v1/models/demo/version1/:predict', data=data)
return resp.textReference: TensorFlow Serving Documentation
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.