Cloud Computing 14 min read

Large-Scale Distributed Reinforcement Learning Solution Based on TKE

The project replaces cumbersome manual management of thousands of heterogeneous CPU and GPU nodes for large‑scale reinforcement learning with a TKE‑based, containerized actor‑learner architecture that automates batch start/stop, provides elastic autoscaling, fault‑tolerant processes, shared model storage, and CI‑driven image deployment, cutting costs by up to two‑thirds while dramatically speeding experiment cycles.

Tencent Cloud Developer

Oct 11, 2019

Large-Scale Distributed Reinforcement Learning Solution Based on TKE

Large‑scale reinforcement learning (RL) requires massive heterogeneous compute resources, rapid batch start/stop of training tasks, frequent model‑parameter updates, and cross‑machine/process model sharing. Traditional manual management is cumbersome, uncertain, and cannot support such scenarios.

1. Project Challenges

Budget constraints : A single full‑scale experiment may need tens of thousands of CPU cores and hundreds of GPU cards, running for one to two weeks, leading to low overall resource utilization and high cost.

Complexity of managing thousands of machines : Manual handling of IPs, accounts, passwords, driver installation, and environment setup is error‑prone; code updates are difficult to roll out consistently.

Efficiency issues : Distributed training code must quickly start/stop tens of thousands of role processes; SSH‑based scripts are slow and unreliable.

Process fault tolerance : Massive processes lack automatic monitoring and restart, resulting in low fault tolerance.

Elastic scaling of training tasks : Manual adjustment of actor numbers hampers throughput when production speed is insufficient.

Code version management : Multiple modules and differing deployment versions make debugging and consistency hard.

2. Training Architecture

The system follows an Actor‑Learner architecture (e.g., IMPALA) with four roles:

Actor : Generates observation trajectories.

Learner : Consumes observations and updates the neural‑network model via gradient descent.

ModelPool : Intermediate storage for the model; Learners push updated models, Actors pull the latest version.

Manager : Handles self‑play scheduling, hyper‑parameter mutation, checkpointing, etc.

Typical resource allocation per role:

Actor – dozens to ten‑thousands, each using 3–4 CPU cores.

Learner – a few to several hundred, each using 1 GPU.

ModelPool & Manager – deployed on high‑bandwidth (≥25 Gbps) nodes.

3. Business Requirements

Batch start/stop of multiple role processes.

No manual IP/account/password management; only resource specifications per process.

Fault‑tolerant data‑producer processes with horizontal scaling.

Shared network storage for model exchange between training and evaluation.

Non‑intrusive logging, fast log search, dashboard‑style cluster monitoring.

Web‑based visualization of training/evaluation results.

Elastic resource usage with pay‑as‑you‑go billing.

4. TKE‑Based Solution

The solution leverages Tencent Kubernetes Engine (TKE) to integrate cloud CVM resources, providing the required CPU and GPU capacity. LoadBalancer services expose TensorBoard and AI win‑rate visualizations. A shared CFS volume enables model and result sharing across pods. CI integration (Orange‑CI + webhook) automates image building and pushes to the TKE image registry. Jinja templates generate deployment YAMLs, allowing rapid scaling of actors, learners, and resource specifications. kubectl commands handle batch start/stop, edit, and delete operations. ReplicaSets manage actor pods for automatic restart and scaling. The cluster autoscaler provides elastic scaling and cost control.

5. Innovations

Resource‑centric scheduling: declare CPU, memory, GPU per role, simplifying management.

Elastic resource usage with automatic cluster scaling and pay‑per‑use billing.

Fault‑tolerant processes with automatic restart and horizontal scaling.

Leverage Tencent Cloud services (logging, monitoring, shared storage, image registry) to avoid reinventing the wheel.

6. Value Delivered by TKE

Significant improvement in experiment efficiency by eliminating manual machine management.

Faster release cycles: container images enable one‑click updates, reducing rollout time from hours to minutes.

Cost reduction: elastic resource usage saves up to two‑thirds of the cost compared with owning physical machines.

Cluster autoscaling dynamically adjusts node count, reducing human effort.

Resource‑oriented management abstracts away individual machines.

Dynamic scheduling improves resource utilization.

Containerization ensures environment consistency and easy version rollback.

CI integration accelerates development workflow.

Persistent storage (CFS, CBS) simplifies data sharing and result persistence.

7. Encountered Issues

etcd performance bottleneck : Tens of thousands of nodes stress etcd; TKE auto‑scales etcd based on node count, apiserver latency, and cluster identifiers.

Image registry concurrency : Hundreds of thousands of pods pulling images simultaneously; mitigated by pre‑pulling and staged pulls.

8. References

[1] Horgan, Dan, et al. "Distributed prioritized experience replay." arXiv preprint arXiv:1803.00933 (2018).

[2] Espeholt, Lasse, et al. "Impala: Scalable distributed deep‑rl with importance weighted actor‑learner architectures." arXiv preprint arXiv:1802.01561 (2018).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native kubernetes Resource Management reinforcement learning distributed training Scalable Computing CI/CD

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.