Artificial Intelligence 11 min read

Deploying TensorFlow 2.x Models with TensorFlow Serving: Concepts, Setup, and Usage

This guide explains the core concepts of TensorFlow Serving, shows how to prepare Docker images, save TensorFlow 2.x models in various formats, configure version policies, warm‑up models, start the service, and invoke it via gRPC or HTTP with complete code examples.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
Deploying TensorFlow 2.x Models with TensorFlow Serving: Concepts, Setup, and Usage

In this guide we continue the previous article on TensorFlow 1.x model deployment with TensorFlow Serving and describe how to deploy TensorFlow 2.x models.

Core concepts of TensorFlow Serving include:

Servables – abstracted model services typically exposed via HTTP REST and gRPC servers.

Sources – discover models from directories and create servable streams.

Loader – API for loading and unloading servables.

Aspired version – the set of versions that should be loaded, provided by sources and managed by the manager.

Manager – controls the full lifecycle of servables, applying a version policy.

VersionPolicy – default policies are Availability Preserving (always keep at least one version loaded) and Resource Preserving (load only one version to save resources).

ServableHandler – handles client requests for a specific servable.

The servable lifecycle proceeds as follows: a source creates a loader for a specific version, notifies the manager, the manager decides (based on the version policy) whether to load a new version or unload an old one, allocates resources, and returns a handle to clients.

TensorFlow 2.x model deployment

Prepare the TensorFlow Serving environment using Docker images. Pull the CPU image:

docker pull tensorflow/serving:latest

Pull the GPU image (requires nvidia‑docker):

docker pull tensorflow/serving:latest-devel-gpu

Save models in one of the supported formats:

Checkpoint format:

model.save_weights("./xxx.ckpt", save_format="tf")

H5 format:

model.save("./xxx.h5")
model.save_weights("./xxx.h5", save_format="h5")

SavedModel format (recommended for TF Serving):

model.save("./xxx", save_format="tf")
tf.saved_model.save(obj, "./xxx")

Inspect the SavedModel structure (variables, assets, .pb) – see the diagram in the original article.

Warm‑up model

Because TensorFlow lazily loads components, the first inference can be slow. Generate a TFRecord warm‑up file and place it in assets.extra of the model version:

# coding=utf-8
import tensorflow as tf
from tensorflow_serving.apis import model_pb2, predict_pb2, prediction_log_pb2

def main():
    with tf.io.TFRecordWriter("tf_serving_warmup_requests") as writer:
        request = predict_pb2.PredictRequest(
            model_spec=model_pb2.ModelSpec(name="demo", signature_name='serving_default'),
            inputs={"x": tf.make_tensor_proto(["warm"]), "y": tf.make_tensor_proto(["up"])})
        log = prediction_log_pb2.PredictionLog(predict_log=prediction_log_pb2.PredictLog(request=request))
        writer.write(log.SerializeToString())

if __name__ == "__main__":
    main()

Copy the generated tf_serving_warmup_requests file into the model’s assets.extra directory.

Model maintenance

Configure version policies via a model.config file. Example for loading specific versions 1 and 2:

model_config_list {
  config {
    name: "demo"
    base_path: "/models/demo"
    model_platform: "tensorflow"
    model_version_policy {
      specific {
        versions: 1
        versions: 2
      }
    }
  }
}

Start the container with the config file:

docker run -p 8500:8500 -p 8501:8501 \
  --mount "type=bind,source=/home/test/ybq/model/demo,target=/models/demo" \
  -e MODEL_NAME=demo \
  tensorflow/serving:latest \
  --model_config_file=/models/demo/model.config \
  --model_config_file_poll_wait_seconds=60

For multiple models, list several config blocks in the same file and mount the parent directory.

Service start

CPU model deployment command:

docker run -p 8500:8500 -p 8501:8501 \
  --mount "type=bind,source=/home/test/ybq/model/demo,target=/models/demo" \
  -e MODEL_NAME=demo tensorflow/serving:latest

GPU model deployment command (requires --runtime nvidia ):

docker run -p 8500:8500 -p 8501:8501 \
  --runtime nvidia \
  --mount "type=bind,source=/home/test/ybq/model/demo,target=/models/demo" \
  -e MODEL_NAME=demo tensorflow/serving:latest-gpu

Service call

Two interfaces are available: gRPC (default port 8500) and HTTP (default port 8501). Example gRPC client:

# coding=utf-8
import grpc, json, tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc

def test_grpc():
    channel = grpc.insecure_channel('127.0.0.1:8500')
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    request = predict_pb2.PredictRequest()
    request.model_spec.name = "demo"
    request.model_spec.signature_name = "test_concat"
    request.inputs['a'].CopyFrom(tf.make_tensor_proto("xxx"))
    result = stub.Predict(request, 10.0)
    return result.outputs

Example HTTP client:

# coding=utf-8
import requests, json

def test_http():
    params = json.dumps({
        "signature_name": "test_concat",
        "inputs": {"a": "xxx"}
    })
    data = json.dumps(params)
    resp = requests.post('http://127.0.0.1:8501/v1/models/demo/version1/:predict', data=data)
    return resp.text

Reference: TensorFlow Serving Documentation

DockerModel DeploymentgRPCHTTPTensorFlow ServingWarmupVersion Policy
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.