Artificial Intelligence 19 min read

ElasticDL: An Open‑Source Distributed Deep Learning Framework with Elastic Scheduling

ElasticDL is an open‑source distributed deep learning framework built on TensorFlow 2.x and Kubernetes that simplifies programming by letting users define models with the Keras API, while providing elastic scheduling, fault tolerance, and significant performance gains demonstrated through extensive benchmarks.

AntTech
AntTech
AntTech
ElasticDL: An Open‑Source Distributed Deep Learning Framework with Elastic Scheduling

ElasticDL is a distributed deep‑learning programming framework built on TensorFlow 2.x and Kubernetes. First released at the 2019 Google Developer Day, it has been continuously improved with a focus on performance optimization and real‑world deployments.

The primary design goal of ElasticDL is to simplify distributed programming. Users only need to provide a model written with the TensorFlow 2.x Keras API; no explicit distributed training code is required. Once a model runs locally, ElasticDL can train it at scale, dramatically improving development efficiency.

Compared with existing solutions—TensorFlow Estimator (graph‑mode only, no eager execution), the Keras API (limited distributed strategies), and Horovod (intrusive API)—all three lack elastic scheduling and require Kubernetes operators, which are difficult for AI practitioners to manage.

ElasticDL introduces a master process per job that coordinates workers without relying on a Kubernetes operator. The master dynamically partitions data into small tasks (file/table name, offset, record count) and stores them in a durable queue (e.g., etcd). Workers pull tasks via gRPC, process minibatches, and report completion. The system supports both asynchronous and synchronous SGD, using a Go‑implemented parameter server or a fault‑tolerant AllReduce library (FTlib). Faulty workers trigger task re‑queueing, ensuring robustness.

The programming model requires four user‑defined functions: def forward(): inputs = tf.keras.Input(shape=(28, 28), name="image") x = tf.keras.layers.Reshape((28, 28, 1))(inputs) x = tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation="relu")(x) x = tf.keras.layers.Conv2D(64, kernel_size=(3, 3), activation="relu")(x) x = tf.keras.layers.BatchNormalization()(x) x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(x) x = tf.keras.layers.Dropout(0.25)(x) x = tf.keras.layers.Flatten()(x) outputs = tf.keras.layers.Dense(10)(x) return tf.keras.Model(inputs=inputs, outputs=outputs, name="mnist_model") and the corresponding def loss(labels, predictions): labels = tf.reshape(labels, [-1]) return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=predictions, labels=labels)) def optimizer(lr=0.1): return tf.optimizers.SGD(lr) def feed(dataset, mode, _): def _parse_data(record): if mode == Mode.PREDICTION: feature_description = {"image": tf.io.FixedLenFeature([28, 28], tf.float32)} else: feature_description = {"image": tf.io.FixedLenFeature([28, 28], tf.float32), "label": tf.io.FixedLenFeature([1], tf.int64)} r = tf.io.parse_single_example(record, feature_description) features = {"image": tf.math.divide(tf.cast(r["image"], tf.float32), 255.0)} if mode == Mode.PREDICTION: return features else: return features, tf.cast(r["label"], tf.int32) dataset = dataset.map(_parse_data) if mode == Mode.TRAINING: dataset = dataset.shuffle(buffer_size=1024) return dataset . These functions can be unit‑tested independently and run locally before being submitted to a Kubernetes cluster.

Recent performance work replaced the original Redis‑based parameter server with a custom Go server capable of performing lightweight deep‑learning computations, reducing communication overhead. Optimizations such as lazy embedding initialization, sharding embedding tables across multiple parameter‑server processes, deduplication of embedding IDs, and gradient aggregation cut training time by roughly 13× (e.g., DeepFM experiment: 1350 s → 106 s for 10 epochs).

Elastic scheduling dramatically improves cluster utilization. In a two‑job overlap experiment, gang scheduling required sequential execution (total ≈ 795 s), whereas ElasticDL started the second job after 30 s and finished both in ≈ 580 s. A mixed workload experiment showed ElasticDL automatically yielding CPU to an NGINX service during traffic spikes while maintaining > 90 % overall utilization. Convergence tests on Wide & Deep and xDeepFM models demonstrated that varying worker counts does not affect model convergence.

In summary, ElasticDL leverages TensorFlow 2.x eager execution and Kubernetes to provide a fault‑tolerant, elastically scheduled distributed training platform that reduces job start latency, maximizes hardware utilization, and accelerates AI development for complex financial services at Ant Group.

Performance OptimizationKubernetesTensorFlowelastic schedulingDistributed Deep LearningElasticDL
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.