Artificial Intelligence 11 min read

Deploying TensorFlow 2.x Models with TensorFlow Serving: Architecture, Setup, and Usage

This article explains the core concepts of TensorFlow Serving, shows how to prepare the environment with Docker, convert TensorFlow 2.x models to the SavedModel format, configure version policies, warm‑up the service, and invoke predictions via gRPC or HTTP interfaces.

360 Quality & Efficiency
360 Quality & Efficiency
360 Quality & Efficiency
Deploying TensorFlow 2.x Models with TensorFlow Serving: Architecture, Setup, and Usage

The article continues a previous guide on TensorFlow Serving by focusing on TensorFlow 2.x model deployment. It first introduces the architecture of TensorFlow Serving, describing key components such as servables, sources, loaders, aspired versions, managers, version policies, and servable handlers.

Core Concepts

Servables: abstracted model services composed of HTTP/gRPC servers and model management.

Sources: discover models from directories and create servable streams.

Loader: API for loading and unloading servables.

Aspired Version: set of servables that should be loaded.

Manager: controls the lifecycle of servables according to version policies.

VersionPolicy: Availability Preserving (load new version before unloading old) and Resource Preserving (avoid loading two versions simultaneously).

ServableHandler: handles client requests.

TensorFlow Serving Environment Preparation

Install TensorFlow Serving using Docker images:

docker pull tensorflow/serving:latest

For GPU support:

docker pull tensorflow/serving:latest-devel-gpu

Install NVIDIA Docker on Ubuntu:

sudo apt-get install -y nvidia-docker2

Pull the GPU image:

docker pull tensorflow/serving:latest-devel-gpu

Model Saving Formats (TensorFlow 2.x)

model.save_weights("./xxx.ckpt", save_format="tf")
model.save("./xxx.h5")
model.save_weights("./xxx.h5", save_format="h5")
model.save("./xxx", save_format="tf")
tf.saved_model.save(obj, "./xxx")
model.to_json()

TensorFlow Serving supports the SavedModel format; the directory contains .pb (graph), variables (weights), and assets (extra files).

Example Model Export

# coding=utf-8
import tensorflow as tf

class TestTFServing(tf.Module):
    def __init__(self):
        self.x = tf.Variable("hello", dtype=tf.string, trainable=True)

    @tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.string)])
    def concat_str(self, a):
        self.x = self.x + a
        return self.x

    @tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.string)])
    def cp_str(self, b):
        self.x.assign(b)
        return self.x

if __name__ == '__main__':
    demo = TestTFServing()
    tf.saved_model.save(demo, "model\\test\\1", signatures={"test_assign": demo.cp_str, "test_concat": demo.concat_str})

Another example shows a custom Keras model saved in TensorFlow format:

# coding=utf-8
import tensorflow as tf

class DenseNet(tf.keras.Model):
    def __init__(self):
        super().__init__()
    def build(self, input_shape):
        self.dense1 = tf.keras.layers.Dense(15, activation='relu')
        self.dense2 = tf.keras.layers.Dense(10, activation='relu')
        self.dense3 = tf.keras.layers.Dense(1, activation='sigmoid')
        super().build(input_shape)
    def call(self, x):
        x = self.dense1(x)
        x = self.dense2(x)
        x = self.dense3(x)
        return x

if __name__ == '__main__':
    model = DenseNet()
    model.build(input_shape=(None, 15))
    model.summary()
    inputs = tf.random.uniform(shape=(10, 15))
    model._set_inputs(inputs=inputs)  # tf2.0 need this line
    model.save(".\\model\\test\\2", save_format="tf")
    tf.keras.models.save_model(model, ".\\model\\test\\3", save_format="tf")

Service Startup

Run the CPU version container:

docker run -p 8500:8500 -p 8501:8501 --mount "type=bind,source=/home/test/ybq/model/demo,target=/models/demo" -e MODEL_NAME=demo tensorflow/serving:latest

Run the GPU version container:

docker run -p 8500:8500 -p 8501:8501 --runtime nvidia --mount "type=bind,source=/home/test/ybq/model/demo,target=/models/demo" -e MODEL_NAME=demo tensorflow/serving:latest-gpu

Warm‑up Model

Generate a TFRecord warm‑up file and place it in assets.extra of the model version:

# coding=utf-8
import tensorflow as tf
from tensorflow_serving.apis import model_pb2, predict_pb2, prediction_log_pb2

def main():
    with tf.io.TFRecordWriter("tf_serving_warmup_requests") as writer:
        request = predict_pb2.PredictRequest(
            model_spec=model_pb2.ModelSpec(name="demo", signature_name='serving_default'),
            inputs={"x": tf.make_tensor_proto(["warm"]), "y": tf.make_tensor_proto(["up"])})
        log = prediction_log_pb2.PredictionLog(predict_log=prediction_log_pb2.PredictLog(request=request))
        writer.write(log.SerializeToString())
if __name__ == "__main__":
    main()

Model Maintenance

Configure version policies via model.config :

model_config_list {
  config {
    name: "demo"
    base_path: "/models/demo"
    model_platform: "tensorflow"
    model_version_policy {
      specific { versions: 1 versions: 2 }
    }
  }
}

For multiple models, list multiple config blocks in the same file.

Start serving with the config file:

docker run -p 8500:8500 -p 8501:8501 \
  --mount "type=bind,source=/home/test/ybq/model/demo,target=/models/demo" \
  -e MODEL_NAME=demo tensorflow/serving:latest \
  --model_config_file=/models/demo/model.config \
  --model_config_file_poll_wait_seconds=60

Service Invocation

TensorFlow Serving offers gRPC (port 8500) and HTTP (port 8501) endpoints. Use saved_model_cli to inspect model signatures.

gRPC client example:

# coding=utf-8
import grpc, json, tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc

def test_grpc():
    channel = grpc.insecure_channel('127.0.0.1:8500')
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    request = predict_pb2.PredictRequest()
    request.model_spec.name = "demo"
    request.model_spec.signature_name = "test_concat"
    request.inputs['a'].CopyFrom(tf.make_tensor_proto("xxx"))
    result = stub.Predict(request, 10.0)
    return result.outputs

HTTP client example:

# coding=utf-8
import requests, json

def test_http():
    params = json.dumps({"signature_name": "test_concat", "inputs": {"a": "xxx"}})
    rep = requests.post('http://127.0.0.1:8501/v1/models/demo/version1/:predict', data=params)
    return rep.text

The guide concludes with references to the official TensorFlow Serving documentation.

DockerPythonModel DeploymentgRPCHTTPTensorFlow ServingTensorFlow 2.x
360 Quality & Efficiency
Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.