Cloud Native 14 min read

How Meizu Built a Scalable Private Cloud with Kubernetes: Lessons and Practices

This article details Meizu’s private cloud platform built on Kubernetes, covering cluster architecture, single-image deployment, master and minion configurations, Calico networking, 4/7‑layer load balancing, monitoring with Prometheus, logging pipelines, automated deployment, multi‑datacenter strategies, and performance optimizations for a robust, low‑cost infrastructure.

Efficient Ops
Efficient Ops
Efficient Ops
How Meizu Built a Scalable Private Cloud with Kubernetes: Lessons and Practices

Preface

Meizu’s container cloud platform is based on Kubernetes technology. The practice is introduced in six aspects: basic introduction, k8s cluster, container network, external access with 4/7‑layer load balancing, monitoring/alerting/logging, and business release/image/multi‑datacenter.

1. Basic Introduction

The Meizu cloud platform is a private cloud designed to support online services, replacing traditional virtualization. By the end of 2017 three data centers were built, achieving 90% migration of services within the year. A small team follows the k8s community closely, iterating quickly with low‑cost trial‑and‑error, while making localized innovations to address non‑functional requirements without breaking core system upgrades.

2. Kubernetes Cluster

2.1 Single Image

The cluster is installed and deployed using a single Docker image that packages all Kubernetes manifests, scripts, and binaries, enabling one‑click installation, rapid deployment, and upgrade.

2.2 Master

Core components run as static Pods, allowing automatic loading and self‑checking via kubelet probes. Upgrades are performed by specifying a unified image version. The controller‑manager and scheduler run on three physical master nodes for high availability. API Server HA can be achieved via load balancer or DNS.

Controller‑manager restarts may cause node‑state desynchronization; therefore alerts must be configured to monitor core component health.

2.3 Minion

Hardware is not fixed; typical minion configuration is 24‑core CPU (with hyper‑threading), 128 GB memory, and 1 Gbps NIC. Minions run business containers and system pods. Optimizations include interrupt handling, TCP backlog, and swap settings. The OS is CentOS 7, Docker uses the devicemapper driver, and logs are stored on an external EmptyDir volume without LVM to avoid metadata issues.

Device mapper caused kernel issues, so a custom 3.16 kernel was compiled. Nodes are labeled with function, location, and rack information to aid pod scheduling and node affinity.

3. Container Network

Calico is used for networking. Hosts connect to core routers via BGP (or RouteReflector). Data plane uses three‑layer routing; packets pass through netfilter and conntrack before reaching containers. Calico is deployed as a DaemonSet on k8s.

Optimization focuses on reducing conntrack usage by preferring headless services and minimizing iptables rules. Monitoring of conntrack usage is performed, and containers actively ping switches to ensure network connectivity. When Calico fails, containers are excluded from services to maintain reliability.

External traffic mainly passes through LVS; bypassing conntrack for LVS traffic reduces record overhead.

4. External Access – 4/7‑Layer Load Balancing

4.1 Layer‑4 Load Balancing

Full‑nat LVS, an open‑source solution from Alibaba, is used for layer‑4 load balancing, supporting both TCP and UDP. Traffic from clients reaches the LVS, which performs full‑nat and forwards to minions; responses are routed back through the same LVS.

VIPs and ECMP routes are generated automatically; VirtualServer configurations map to endpoint IPs without manual intervention.

LVS control program was extended to expose metrics for Grafana visualization and alerting. Both overall LVS traffic anomalies and high‑latency real‑server (Pod) alerts are monitored.

4.2 Layer‑7 Load Balancing

Layer‑7 load balancing runs nginx + ingress controller inside a dedicated POD, acting as a business‑specific reverse proxy with auto‑scaling. It routes traffic to Jetty business containers.

Because Full‑nat LVS hides the client IP, the TOA module is used to retrieve the original IP from TCP options. The TOA module was ported to kernel 3.16 to support this requirement.

After containerizing nginx, high latency was observed due to suboptimal worker count and affinity settings. By default, a 24‑core node configures 24 workers, but CPU limits are often set to 5‑6 cores, causing resource contention. Adjusting worker numbers and affinity reduced latency.

Graceful shutdown of Nginx PODs is handled via preStop hooks; scaling up must consider pod warm‑up time, and probe timeouts should be tuned to avoid premature restarts or pod removal.

5. Monitoring, Alerting, and Logging

Prometheus is used for monitoring, deployed as DaemonSets or Deployments on k8s and scheduled to specific node types. Metrics include hard indicators (QPS, HTTP codes, resource consumption) and soft business indicators (JVM metrics, error counters).

Logs are collected by Elasticsearch, also deployed as PODs with attention to thread‑CPU matching. Fluentd was initially used but caused high resource usage; switching to Filebeat reduced CPU to <0.1 core and memory to <100 MB while maintaining reliable log transmission.

6. Business Release, Images, and Multi‑Datacenter

A web UI generates k8s resource manifests and executes create/update/delete actions. JSON schema describes all parameters with defaults; user input produces the final manifests.

Ansible invokes kubectl for automated deployment, providing progress tracking, release history, and templated deployments. By switching k8s contexts, the same pipeline can deploy to multiple clusters across different data centers.

Images are kept minimal yet functional, with glibc support for compatibility. While Docker recommends a single process per container, many services need multiple processes; S6 is used as a process manager to handle such scenarios.

Overall, this low‑cost private cloud solution leverages Kubernetes benefits to address 4/7‑layer load balancing and various non‑functional challenges, while rapidly building supporting systems such as monitoring, alerting, logging, and release pipelines, greatly improving efficiency and maintainability.

monitoringKubernetesLoad Balancingprivate cloudContainer Networking
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.