Artificial Intelligence 13 min read

ElasticDL: An Open‑Source Elastic Deep Learning System Built on TensorFlow 2.0 and Kubernetes

ElasticDL, the first industry‑level open‑source system for elastic deep learning on TensorFlow, leverages Kubernetes‑native scheduling, fault‑tolerance, and TensorFlow 2.0 Eager Execution to dramatically improve cluster utilization, simplify distributed training, and integrate seamlessly with tools like Kubeflow and SQLFlow.

AntTech
AntTech
AntTech
ElasticDL: An Open‑Source Elastic Deep Learning System Built on TensorFlow 2.0 and Kubernetes

On September 11, Ant Financial announced the open‑source release of ElasticDL at the 2019 Google Developer Conference Shanghai, positioning it as the first industry‑level elastic deep‑learning system implemented on TensorFlow.

The project’s source code is available at elasticdl.org , and the team was interviewed by Open Source China for a detailed technical overview.

ElasticDL is built on TensorFlow 2.0 and Kubernetes and offers four core characteristics: fault‑tolerance, elastic scheduling, ease of use, and high efficiency, with fault‑tolerance and elastic scheduling being the most distinctive.

Fault‑tolerance: jobs continue unchanged when the number of processes varies.

Elastic scheduling: the system dynamically adjusts the number of workers based on cluster workload, dramatically increasing overall GPU utilization from 1/N to N/N.

Ease of use: developers can invoke standard Keras model‑fit or Estimator APIs without dealing with low‑level runtime details.

High efficiency: the architecture maximizes cluster resource usage and reduces job pending time.

ElasticDL achieves these capabilities through a Kubernetes‑native architecture: the command‑line client calls the Kubernetes API to launch a master process, which in turn creates workers and parameter‑server processes via the same API. The master’s task queues (todo, doing, done) are persisted in etcd, enabling seamless recovery if the master crashes.

Unlike traditional TensorFlow 1.x graph‑mode distributed training, which lacks fault‑tolerance, ElasticDL leverages TensorFlow 2.0’s Eager Execution and the “tape” API to obtain gradients directly in Python, avoiding complex graph hacking.

The system’s design contrasts with other TensorFlow‑based distributed training solutions that modify the TensorFlow runtime (written in C++) to achieve high‑performance communication but cannot easily integrate with external cluster managers for elastic scheduling.

ElasticDL also positions itself as an alternative to Kubeflow for distributed TensorFlow jobs: while Kubeflow relies on TensorFlow’s native, non‑fault‑tolerant runtime, ElasticDL provides both fault‑tolerance and elastic scheduling, making it more suitable for large‑scale, shared‑cluster environments.

Integration with SQLFlow further enhances usability: developers can describe end‑to‑end AI pipelines with extended SQL syntax, and SQLFlow translates these statements into calls to engines such as XGBoost, PyTorch, or ElasticDL, automatically handling data preprocessing, feature engineering, model training, and prediction.

Future plans for ElasticDL include adding asynchronous SGD with delayed model updates, Kubernetes‑native AllReduce, and automated selection of distribution strategies, as the project remains in an early exploratory stage.

The interviewee, Wang Yi, leads AI infrastructure at Ant Financial and has a background spanning Google China, Tencent, LinkedIn, and Baidu Silicon Valley, with a Ph.D. from Tsinghua University.

Kubernetesfault toleranceTensorFlowAI infrastructureelastic schedulingDistributed Deep LearningElasticDL
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.