Cloud Native 13 min read

Scaling Zhongtong Cloud: From Single‑Cluster to Multi‑Cluster Governance

Drawing from Yang Xiaofei’s SACC2022 talk, this article details Zhongtong Cloud’s two‑year journey from initial containerization to a multi‑cluster architecture, covering challenges, custom scheduler extensions, fixed‑IP handling, container crash‑site preservation, node rebalancing, application migration, cross‑cluster load balancing, and future plans for unified gateways.

ITPUB
ITPUB
ITPUB
Scaling Zhongtong Cloud: From Single‑Cluster to Multi‑Cluster Governance

Background and Challenges

Before 2019, Zhongtong Cloud’s DevOps platform ran almost entirely on virtual machines (VMs). Starting in 2019 the team began a containerization effort, connecting clusters with BGP and supporting mixed VM‑container workloads. By mid‑2022 most production workloads were running in containers across six large clusters, with >18,000 Pods on >700 nodes. The main challenges were:

Strict security requirements for external access (all outbound traffic must be approved).

Single‑cluster failure risk – a fault in one cluster could affect the whole business.

Integration with an existing, separate DevOps platform.

Fixed‑IP requirements for a few legacy services (e.g., Snowflake‑algorithm service).

Resource scarcity and low overall utilization (≈15% before containerization).

Native Kubernetes scheduler could not reflect real‑time node load, leading to imbalance.

Loss of crash‑site information when containers were restarted.

From Single‑Cluster to Multi‑Cluster

Real‑time Load‑Aware Scheduler

The team built a custom scheduler extension that reads per‑node load metrics written to node annotations by a collector. The scheduler parses the annotation, computes a score for each node, and adjusts the nodeScore used by the default scheduler. This enables decisions based on actual CPU, memory and network usage rather than static requests / limits.

Fixed‑IP Support via IPAM

An internal IP Address Management (IPAM) service reserves a pool of static IPs and annotates the corresponding nodes. Pods that require a stable address request an IP from IPAM; the scheduler pins the pod to a node that holds the reserved IP, guaranteeing the pod’s IP does not change across restarts.

Container Crash‑Site Preservation

To avoid losing diagnostic data when a container crashes, a CRI hook is registered on the host. The hook performs the following steps:

Invoke a dump command inside the container to capture stack traces, heap dumps, or core files.

Use crictl to cut the container’s network interface, isolating the faulty instance.

Mark the workload as “failed” so the higher‑level controller (Deployment, StatefulSet) can create a replacement pod.

This approach preserves the failure context without relying on the temporary‑container feature that requires a newer Kubernetes version.

Secondary Node Balancing Job

Even with the load‑aware scheduler, occasional load spikes persisted because node‑exporter sometimes reported zero load, causing the scheduler to overload that node. To mitigate this, a periodic job (implemented as a Kubernetes CronJob) runs a lightweight descheduler‑like algorithm:

Collects real‑time load from node annotations.

Identifies pods on overloaded nodes (CPU > 80% or memory > 75%).

Applies anti‑affinity rules and a whitelist/blacklist of namespaces and apps.

Evicts selected pods, allowing the scheduler to place them on less‑loaded nodes.

The job runs every 5 minutes and logs actions to a central audit store.

Multi‑Cluster Governance

Application Migration

Replication counts can be weighted per cluster. When a new cluster is added, a migration controller redistributes replicas according to configured weights, eventually copying all resources (ConfigMaps, Secrets, PVCs) to the target cluster and decommissioning the source.

Application Scoring

A scoring engine evaluates each service on dimensions such as department, product line, CPU usage, memory usage, replica count, and deviation from defined thresholds. Scores drive remediation tickets and prioritize resource‑rebalancing actions.

Cross‑Cluster Load Balancing

For workloads managed by Horizontal Pod Autoscaler (HPA), a cross‑cluster controller aggregates remaining requests across clusters and distributes new replicas proportionally. The controller also supports:

Cron‑based HPA adjustments for predictable load patterns.

Future message‑queue‑driven scaling (e.g., based on Kafka lag).

Success and Outcomes

Key results after adopting the multi‑cluster practice:

Container utilization rose from ~15% to ~40% of cluster capacity.

Legacy VM and physical‑machine workloads were reclaimed, reducing infrastructure cost.

Full lifecycle management (deploy, rollback, scaling) became faster than the previous VM‑centric process.

Multi‑cluster control enables adding or removing clusters with minimal downtime.

Remaining technical gaps include version drift among clusters, tighter integration with a unified ingress gateway, and more expressive network‑policy handling for east‑west traffic.

Future Plans

Planned enhancements are:

Deploy a unified multi‑cluster ingress gateway that exposes pod IPs, simplifying north‑south traffic and enabling direct pod‑level debugging.

Leverage idle resources during off‑peak periods for batch jobs and big‑data offline tasks.

Extend support for Dubbo services, mixed VM‑container workloads, and richer monitoring integrations (Prometheus, centralized logging).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKubernetesMulti-Clustercontainerization
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.