Artificial Intelligence 9 min read

Running TensorFlow on Kubernetes: A Practical Guide to Scalable AI Workloads

This article explains how to deploy TensorFlow on Kubernetes, addressing resource isolation, GPU scheduling, and distributed training challenges by introducing a custom TensorFlow‑on‑K8s system with client, task, and autospec modules, plus container design for reliable job execution.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Running TensorFlow on Kubernetes: A Practical Guide to Scalable AI Workloads

TensorFlow has become a mature deep‑learning framework that supports multiple languages and heterogeneous platforms, but deploying it on GPU clusters raises issues such as resource isolation, uneven GPU scheduling, lingering processes, manual data handling, and inconvenient log management.

To solve these problems, a Kubernetes‑based scheduling and management system is needed. While Hadoop YARN can manage resources, Kubernetes (K8s) offers built‑in GPU support since TensorFlow 1.6 and serves as a unified scheduler for TensorFlow jobs.

Design Goals

Support both single‑node and distributed TensorFlow tasks.

Automatically generate ClusterSpec for distributed jobs, eliminating manual configuration.

Persist training data, models, and logs across container lifecycles.

Architecture

The TensorFlow‑on‑K8s solution consists of three main components: client , task , and autospec .

The client module receives user task requests and forwards them to the task module.

The task module determines the execution flow based on the task type:

If type=single , it launches a container with the requested resources to run a single‑node TensorFlow job.

If type=distribute , it runs a distributed job, invoking the autospec module to generate the required ClusterSpec automatically.

Client Module

Three methods are offered to provide code and data to the container:

Create a new image containing both.

Mount code and data via a volume.

Fetch code and data from external storage (e.g., S3).

The third approach is chosen to allow frequent code updates without rebuilding images. A tshell client packages the code (e.g., cifar10-multigpu ) and data, uploads the archive to S3, and submits the task.

Task Module

Single‑node mode : The task module simply calls the Python client API to start a container that downloads the uploaded archive, runs the TensorFlow job, and uploads logs and model artifacts back to S3.

Distributed mode : TensorFlow’s distributed execution relies on gRPC and a ClusterSpec that defines Parameter Server (PS) and worker jobs. Previously, users had to manually configure this spec, which is error‑prone.

The autospec module now collects container IPs and ports, automatically generates the ClusterSpec , and distributes it to the appropriate containers.

Autospec Module

The autospec component’s sole purpose is to gather IP and port information from containers during distributed job startup and generate a valid ClusterSpec , which it then sends to the relevant containers.

Container Design

TensorFlow jobs are modeled as Kubernetes Job objects, which are automatically deleted after completion. Kubernetes lifecycle hooks are used:

postStart to fetch data before the container runs.

preStop to back up models and logs before container termination.

This design ensures that both single‑node and distributed workloads preserve their artifacts despite container turnover.

Conclusion

The presented TensorFlow‑on‑Kubernetes system demonstrates a clear workflow for submitting, managing, and tracking AI workloads on a GPU‑enabled cluster. Future work includes a web UI for job submission and monitoring, GPU affinity support, and further automation.

KubernetesTensorFlowAI deploymentGPU schedulingdistributed training
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.