Cloud Computing 28 min read

Building a Containerized Scientific Computing Platform on the Cloud

The talk details XtraPi’s journey from early PBS‑based supercomputers to a modern Kubernetes‑driven, multi‑cloud platform that uses Tencent Cloud TKE to run massive containerized drug‑discovery simulations, describing scaling strategies, image optimization, CI pipelines, checkpoint‑restart, and future serverless and bare‑metal enhancements.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Building a Containerized Scientific Computing Platform on the Cloud

The article introduces the challenges of constructing large‑scale compute clusters in the cloud and describes how to manage massive containerized workloads for scientific computing.

Speaker Lin Shuai‑kang, Technical Director of XtalPi’s cloud computing platform, explains that XtalPi focuses on reducing early‑stage drug R&D cycles through molecular simulation, quantum algorithms, and AI, running these workloads on high‑performance computing (HPC) systems such as Tianhe‑2 and China’s largest supercomputer.

To support drug‑compound screening, XtalPi needs to process billions of molecular structures, generating massive intermediate data despite the original input being only a few kilobytes. This drives the requirement for a large, elastic compute pool that can scale to hundreds of thousands of CPU cores.

The evolution of XtalPi’s compute platform is outlined:

First generation (2015): PBS scheduler on traditional supercomputers, simple NFS storage.

Second generation (2016‑2018): Migration to Mesos for container orchestration, adoption of Docker for packaging scientific algorithms, and multi‑cloud resource pooling (Tencent Cloud, AWS, Google Cloud).

Third generation (current): Kubernetes (K8s) based clusters, leveraging Tencent Cloud TKE, with support for both Mesos and K8s workloads.

The talk then focuses on practical experiences with Tencent Cloud’s TKE service:

Container images for scientific software can be tens of gigabytes; image size reduction and layered pulling strategies are essential.

Cluster scaling must meet strict latency requirements (e.g., adding 24‑core nodes within 20 minutes) while maintaining high utilization (80‑90%).

CI pipelines use Drone to build Docker images from proprietary SDKs, then push them to both AWS and Tencent registries.

K8s limits (≤5 000 nodes, ≤15 000 pods, ≤30 000 containers) are considered against XtalPi’s typical workload of 100‑3 000 nodes and up to 100 000 tasks submitted at once.

Common bottlenecks include API request timeouts, scheduler latency, DNS overload, and ETCD performance under massive job queues.

Strategies such as custom master sizing, SSD‑backed ETCD, and pod‑affinity rules are employed to improve stability.

To address long‑running HPC jobs, XtalPi implements checkpoint‑restart mechanisms: intermediate results are stored in object storage, allowing failed pods to resume without re‑computing completed work.

Future plans involve serverless Lambda for short‑duration tasks, more sophisticated workflow orchestration with explicit job dependencies, and leveraging bare‑metal instances with high‑speed RDMA/InfiniBand networks for MPI‑intensive workloads.

cloud-computingHigh Performance ComputingKubernetescontainerizationMesosTencent CloudHPC
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.