Integrating Distributed TensorFlow with Kubernetes: Architecture and Deployment
The article explains how to combine Distributed TensorFlow with Kubernetes—using GlusterFS storage, Deployments for parameter servers, Jobs for workers, service discovery, monitoring, and a Jinja2‑generated YAML template—to create isolated, scalable training clusters with Jupyter and TensorBoard access.
TensorFlow (70K+ GitHub stars) and Kubernetes (27K+ stars) are leaders in deep learning and container orchestration. This article reviews the integration of TensorFlow running on Kubernetes, discussing motivations, architecture, and practical deployment details.
1. Distributed TensorFlow
In April 2016 TensorFlow 0.8 introduced Distributed TensorFlow, enabling training across multiple servers. Large models such as the 68‑billion‑parameter MOE layer require distributed training to be feasible. Distributed TensorFlow allows a TensorFlow cluster to accelerate training by leveraging many machines.
2. Why TensorFlow on Kubernetes
Although Distributed TensorFlow provides scalability, it lacks resource isolation and suffers from parameter‑server (PS) lifecycle issues. Kubernetes excels at isolation, scheduling, and service discovery, making it a natural platform for running TensorFlow clusters.
The authors chose GlusterFS as the distributed storage backend because its read performance on HDFS was insufficient for their workloads.
3. Integrated Architecture
The architecture supports both Between‑Graph and In‑Graph replication scenarios. PS tasks are deployed as Kubernetes Deployments, while worker tasks run as Jobs. Service discovery is handled by Kubernetes Service and KubeDNS. Each TensorFlow cluster creates two PersistentVolumes (PV) via a StorageClass that integrates with GlusterFS through Heketi: one for training data (/data) and one for logs (/log). Users receive isolated namespaces, Jupyter Notebook services (exposed via NodePort), and optional TensorBoard services.
4. Core Components
TensorFlow 1.3.0, Kubernetes 1.7.4, Docker 1.12.6, GlusterFS 3.10.5, Harbor 1.1.2, Contiv netplugin, Keepalived, HAProxy, Etcd2/3, fluentd + Kafka + Elasticsearch + Kibana for logging, and cAdvisor + Prometheus + Grafana for monitoring.
5. Demo
A demo based on Kyle Bai’s GitHub repository demonstrates a simple TensorFlow‑on‑Kubernetes setup using NodePort‑exposed Jupyter Notebook. The demo includes an In‑Graph cluster with a sample master_client.ipynb notebook.
6. Thinking (Q&A)
Q: How to recycle PS pods after training? A: A DevOps TaaS module watches job completions; when all workers finish, it waits 30 seconds and deletes the PS Deployment/Service via the Kubernetes API.
Q: How to checkpoint when PS is stateful? A: Workers use tf.train.Saver to fetch parameters from PS tasks and persist checkpoints.
Q: How to generate Kubernetes YAML from a few user parameters? A: A Jinja2 template (see code block) is used to render the necessary Service, Job, Deployment, and PersistentVolumeClaim resources.
7. Jinja2 Template Example
{% raw %}
{% set name = "imagenet" %} # algorithm name
{% set worker_replicas = 3 %} # number of workers
{% set ps_replicas = 2 %} # number of PS
{% set script = "http://xxx.xx.xx.xxx:80/imagenet/imagenet.py" %} # script URL
{% set image = "tensorflow/tensorflow:1.3.0" %}
{% set data_dir = "/data" %}
{% set log_dir = "/log" %}
{% set port = 2222 %}
{% set replicas = {"worker": worker_replicas, "ps": ps_replicas} %}
{% macro worker_hosts() %}
{% for i in range(worker_replicas) %}{{ name }}-worker-{{ i }}:{{ port }}{% if not loop.last %}, {% endif %}{% endfor %}
{% endmacro %}
{% macro ps_hosts() %}
{% for i in range(ps_replicas) %}{{ name }}-ps-{{ i }}:{{ port }}{% if not loop.last %}, {% endif %}{% endfor %}
{% endmacro %}
{% for job in ["worker", "ps"] %}
{% for i in range(replicas[job]) %}
kind: Service
apiVersion: v1
metadata:
name: {{ name }}-{{ job }}-{{ i }}
spec:
selector:
name: {{ name }}
job: {{ job }}
task: "{{ i }}"
ports:
- port: {{ port }}
targetPort: 2222
{% if job == "worker" %}
---
kind: Job
apiVersion: batch/v1
metadata:
name: {{ name }}-{{ job }}-{{ i }}
spec:
template:
metadata:
labels:
name: {{ name }}
job: {{ job }}
task: "{{ i }}"
spec:
containers:
- name: {{ name }}-{{ job }}-{{ i }}
image: {{ image }}
ports:
- containerPort: 2222
command: ["/bin/sh", "-c"]
args:["curl {{ script }} -o /opt/{{ name }}.py; python /opt/{{ name }}.py \
--ps_hosts={{ ps_hosts() }} \
--worker_hosts={{ worker_hosts() }} \
--job_name={{ job }} \
--task_index={{ i }} \
--log_path={{ log_dir }} \
--data_dir={{ data_dir }} ;"]
volumeMounts:
- name: data
mountPath: {{ data_dir }}
- name: log
mountPath: {{ log_dir }}
restartPolicy: Never
volumes:
- name: data
persistentVolumeClaim:
claimName: {{ name }}-data-pvc
- name: log
persistentVolumeClaim:
claimName: {{ name }}-log-pvc
{% endif %}
{% if job == "ps" %}
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: {{ name }}-{{ job }}-{{ i }}
spec:
replicas: 1
template:
metadata:
labels:
name: {{ name }}
job: {{ job }}
task: "{{ i }}"
spec:
containers:
- name: {{ name }}-{{ job }}-{{ i }}
image: {{ image }}
ports:
- containerPort: 2222
command: ["/bin/sh", "-c"]
args:["curl {{ script }} -o /opt/{{ name }}.py; python /opt/{{ name }}.py \
--ps_hosts={{ ps_hosts() }} \
--worker_hosts={{ worker_hosts() }} \
--job_name={{ job }} \
--task_index={{ i }} \
--log_path={{ log_dir }} ;"]
volumeMounts:
- name: log
mountPath: {{ log_dir }}
restartPolicy: Never
volumes:
- name: log
persistentVolumeClaim:
claimName: {{ name }}-log-pvc
{% endif %}
---
{% endfor %}
{% endfor %}
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: {{ name }}-log-pvc
annotations:
volume.beta.kubernetes.io/storage-class: glusterfs
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: {{ name }}-data-pvc
annotations:
volume.beta.kubernetes.io/storage-class: glusterfs
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
{% endraw %}Running python render_template.py tfcluster_template.yaml.jinja | kubectl apply -f - creates the Between‑Graph TensorFlow cluster.
8. Summary
Combining TensorFlow and Kubernetes unlocks the full power of Distributed TensorFlow. The article provides a practical overview, architecture diagrams, component lists, a demo, and a Jinja2 template for automated deployment. Future work includes custom scheduling, network I/O tuning, TaaS development, and rapid TensorFlow Serving deployment.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.