How Guangdong Mobile Built a Resilient Container Cloud from Scratch
This article details Guangdong Mobile's end‑to‑end journey of designing, constructing, and operating a production‑grade container cloud platform, covering architecture decisions, monitoring, logging, high‑availability, scaling, network optimization, upgrade challenges, and lessons learned for cloud‑native practitioners.
Why is it called "Piercing Thorns"?
In 2018, when container cloud technologies were flourishing, Guangdong Mobile faced repeated difficulties during adoption and transformation due to lack of experience, investment, and senior engineers, requiring self‑reliance across architecture, design, selection, and implementation.
The team used fully open‑source technologies, which evolve rapidly and demand strong technical reserves, while also needing to ensure stability, reliability, monitoring, and a simple user experience for developers of varying skill levels.
Building the Container Cloud from 0 to 1
Starting in 2017‑2018, a small cluster was built for pilot projects, and the DevOps platform was tightly integrated with the container cloud for a joint launch.
Initially the cluster was simple; 2019 focused on operational experience and tooling, while 2020 emphasized promotion, version upgrades, scaling, architectural and configuration optimization, and exploration of new technologies.
CaaS First Generation Launch
The first version used a mature Docker+K8s solution with a high‑availability K8s cluster; limited hardware led to stacking ETCD on control nodes, saving resources but reducing HA tolerance.
High‑availability load balancing was achieved with Nginx+Keepalived, and Calico was chosen for IPv6 and network policy support. A CephFS cluster was added for shared persistent storage.
CaaS Monitoring System
Monitoring is essential for stability. The team initially deployed Prometheus but struggled with fragmented configuration templates. Adoption of kube‑prometheus simplified monitoring, covering 80% of scenarios.
What metrics to monitor? Where do metrics come from? When should alerts fire? These questions were initially unclear.
The monitoring pipeline consists of four stages:
Metric collection from K8s components, node‑exporter, kube‑state‑metrics, ServiceMonitor and custom PodMonitor.
Visualization via Grafana querying Prometheus.
Alert generation using Prometheus CRDs, routed to Altermanger, then dispatched via email, SMS, or webhook.
Alert closure through a custom intelligent operations platform.
CaaS Unified Log Management
Logs are scattered across nodes, containers, and clusters, making retrieval inefficient. The solution mirrors the cluster‑wide log management: Filebeat collects logs from four sources—component logs, APIServer audit logs, K8s events (persisted via kube‑event‑exporter), and container stdout/stderr—and forwards them to Logstash, which parses and stores them in Elasticsearch for Kibana‑based search and visualization.
Component logs (docker/kubelet)
APIServer audit logs
K8s events
Container stdout/stderr
Dual‑Active Centers (Two‑Region Active‑Active)
When many systems migrated to the container cloud, version upgrades became risky. The team built a second K8s cluster in another data center, enabling seamless traffic switchover if one cluster failed.
Upgrade panic! We could no longer experiment freely.
A high‑availability Nginx layer distributes traffic, and stateful services (databases, middleware, image registries) are decoupled to keep both clusters consistent.
Network latency between regions is mitigated by reducing external service calls and using Istio multi‑cluster mode for cross‑cluster service discovery and failover.
CaaS Scaling Issues
Resource exhaustion required node addition, but new nodes could not access external services due to a restrictive firewall policy that only allowed traffic from the master node. Adjusting the policy resolved the issue, and a more automated approach was later adopted.
CaaS Network Architecture Optimization
The cluster uses Calico for networking. Initially BGP mode was chosen for efficiency, but when a new node with a different subnet was added, pods could not communicate because BGP generated routes with the gateway as next hop. Switching to IPIP CrossSubnet mode restored connectivity and allowed dynamic handling of cross‑subnet traffic.
Calico also supports IPv6 via vxLAN, requiring version 3.23+.
Continuous Optimization of Access Experience
With most production systems already on the platform, the focus shifted to simplifying application publishing and ensuring stable operation after migration.
Current Access Status
Most major production systems have been containerized, and many have achieved 100% migration.
Automatic CI/CD Pipelines
The DevOps platform is integrated with the container cloud, enabling end‑to‑end automated deployment after code push, with per‑application and per‑environment switches for flexible rollout control.
Application Orchestration
Initially users supplied raw K8s YAML files, which proved cumbersome. The team introduced Helm charts to provide a base deployment framework, later adding a visual orchestration UI backed by a Go client to the K8s API.
Application Deployment to Cloud
Users configure pipelines that trigger one‑click deployment to the container cloud, after which the application lifecycle is managed via a centralized application management console.
Business High Availability
Container‑level health probes (liveness and readiness).
Application‑level multi‑replica redundancy.
Horizontal and Vertical Pod Autoscaling (HPA/VPA) for dynamic scaling.
Regional failover across active‑active clusters.
HPA
HPA targets stateless workloads, scaling replica counts based on metrics from Prometheus (custom metrics are preferred over resource metrics for flexibility).
VPA
VPA adjusts resource requests for stateful workloads that typically run a single replica, using historical usage data to recommend new settings and performing pod eviction/recreation when necessary.
Regional Fault Transfer
Istio multi‑cluster mode enables services in one cluster to discover healthy endpoints in the other cluster, allowing automatic traffic shift during failures.
Application Runtime Monitoring
Beyond the core kube‑prometheus stack, enabling ServiceMonitor for Istio sidecars collects four main metric sources: kube‑state‑metrics, cAdvisor, Istio sidecar, and custom ServiceMonitor/PodMonitor definitions.
kube‑state‑metrics – application and resource state.
cAdvisor – container performance and resource usage.
Istio sidecar – traffic metrics for golden‑signal analysis.
ServiceMonitor/PodMonitor – custom metric sources.
Application Log Unified Management
Application logs are routed to stdout, collected by Filebeat, and processed through Logstash into Elasticsearch for unified search and visualization, mirroring the cluster‑wide log strategy.
K8s Upgrade Tips
Key lessons from ten+ version upgrades include handling automatic pod restarts, avoiding multi‑major jumps, checking API deprecations, ensuring third‑party compatibility, and replacing Dockershim with CRI‑Dockerd after v1.24.
Pods restart during upgrades; dual‑active clusters mitigate impact.
Do not skip major versions.
Audit API usage for removals.
Verify third‑party software compatibility.
Replace Dockershim post‑v1.24.
Calico Mode Selection
While BGP offers high efficiency, IPIP CrossSubnet mode is safer for heterogeneous subnets, automatically enabling encapsulation when needed. IPv6 support requires vxLAN mode in newer Calico releases.
K8s and firewalld iptables Conflict
Restarting firewalld removes CNI‑injected DNAT rules and adds default DROP policies, breaking NodePort/HostPort/HostNetwork access. The recommendation is to disable firewalld or manage firewall rules via Calico policies.
Istio Multi‑Cluster Common Issues
Sidecar resource consumption doubles with full multi‑cluster data; limiting sidecar scope via Sidecar resources or pilot gateway filters reduces overhead. Mutual TLS requires shared root certificates, and each pod must have a Service to be part of the mesh for proper fault detection.
Conclusion
The journey was challenging, especially the initial decision to start from scratch, but leveraging community experience and open‑source tools enabled Guangdong Mobile to successfully build a production‑grade container cloud. The team hopes the shared lessons help others avoid similar pitfalls.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.