Scaling Zhongtong Cloud: From Single‑Cluster to Multi‑Cluster Governance
Drawing from Yang Xiaofei’s SACC2022 talk, this article details Zhongtong Cloud’s two‑year journey from initial containerization to a multi‑cluster architecture, covering challenges, custom scheduler extensions, fixed‑IP handling, container crash‑site preservation, node rebalancing, application migration, cross‑cluster load balancing, and future plans for unified gateways.
Background and Challenges
Before 2019, Zhongtong Cloud’s DevOps platform ran almost entirely on virtual machines (VMs). Starting in 2019 the team began a containerization effort, connecting clusters with BGP and supporting mixed VM‑container workloads. By mid‑2022 most production workloads were running in containers across six large clusters, with >18,000 Pods on >700 nodes. The main challenges were:
Strict security requirements for external access (all outbound traffic must be approved).
Single‑cluster failure risk – a fault in one cluster could affect the whole business.
Integration with an existing, separate DevOps platform.
Fixed‑IP requirements for a few legacy services (e.g., Snowflake‑algorithm service).
Resource scarcity and low overall utilization (≈15% before containerization).
Native Kubernetes scheduler could not reflect real‑time node load, leading to imbalance.
Loss of crash‑site information when containers were restarted.
From Single‑Cluster to Multi‑Cluster
Real‑time Load‑Aware Scheduler
The team built a custom scheduler extension that reads per‑node load metrics written to node annotations by a collector. The scheduler parses the annotation, computes a score for each node, and adjusts the nodeScore used by the default scheduler. This enables decisions based on actual CPU, memory and network usage rather than static requests / limits.
Fixed‑IP Support via IPAM
An internal IP Address Management (IPAM) service reserves a pool of static IPs and annotates the corresponding nodes. Pods that require a stable address request an IP from IPAM; the scheduler pins the pod to a node that holds the reserved IP, guaranteeing the pod’s IP does not change across restarts.
Container Crash‑Site Preservation
To avoid losing diagnostic data when a container crashes, a CRI hook is registered on the host. The hook performs the following steps:
Invoke a dump command inside the container to capture stack traces, heap dumps, or core files.
Use crictl to cut the container’s network interface, isolating the faulty instance.
Mark the workload as “failed” so the higher‑level controller (Deployment, StatefulSet) can create a replacement pod.
This approach preserves the failure context without relying on the temporary‑container feature that requires a newer Kubernetes version.
Secondary Node Balancing Job
Even with the load‑aware scheduler, occasional load spikes persisted because node‑exporter sometimes reported zero load, causing the scheduler to overload that node. To mitigate this, a periodic job (implemented as a Kubernetes CronJob) runs a lightweight descheduler‑like algorithm:
Collects real‑time load from node annotations.
Identifies pods on overloaded nodes (CPU > 80% or memory > 75%).
Applies anti‑affinity rules and a whitelist/blacklist of namespaces and apps.
Evicts selected pods, allowing the scheduler to place them on less‑loaded nodes.
The job runs every 5 minutes and logs actions to a central audit store.
Multi‑Cluster Governance
Application Migration
Replication counts can be weighted per cluster. When a new cluster is added, a migration controller redistributes replicas according to configured weights, eventually copying all resources (ConfigMaps, Secrets, PVCs) to the target cluster and decommissioning the source.
Application Scoring
A scoring engine evaluates each service on dimensions such as department, product line, CPU usage, memory usage, replica count, and deviation from defined thresholds. Scores drive remediation tickets and prioritize resource‑rebalancing actions.
Cross‑Cluster Load Balancing
For workloads managed by Horizontal Pod Autoscaler (HPA), a cross‑cluster controller aggregates remaining requests across clusters and distributes new replicas proportionally. The controller also supports:
Cron‑based HPA adjustments for predictable load patterns.
Future message‑queue‑driven scaling (e.g., based on Kafka lag).
Success and Outcomes
Key results after adopting the multi‑cluster practice:
Container utilization rose from ~15% to ~40% of cluster capacity.
Legacy VM and physical‑machine workloads were reclaimed, reducing infrastructure cost.
Full lifecycle management (deploy, rollback, scaling) became faster than the previous VM‑centric process.
Multi‑cluster control enables adding or removing clusters with minimal downtime.
Remaining technical gaps include version drift among clusters, tighter integration with a unified ingress gateway, and more expressive network‑policy handling for east‑west traffic.
Future Plans
Planned enhancements are:
Deploy a unified multi‑cluster ingress gateway that exposes pod IPs, simplifying north‑south traffic and enabling direct pod‑level debugging.
Leverage idle resources during off‑peak periods for batch jobs and big‑data offline tasks.
Extend support for Dubbo services, mixed VM‑container workloads, and richer monitoring integrations (Prometheus, centralized logging).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
