Meitu's Container Platform: Architecture, Network, Load Balancing, Logging, Scheduling, and Autoscaling
Meitu’s container platform, built on Kubernetes with Calico networking, a custom Nginx load‑balancer, unified logging, refined scheduling, autoscaling, and comprehensive monitoring, enables seamless multi‑cluster hybrid‑cloud operations for its hundreds‑of‑millions‑user services while providing CI/CD tooling and future‑ready extensions such as service mesh and edge computing.
This article is a written version of a talk that shares Meitu's experience in building a container foundation platform, the problems encountered during the migration, and the concrete solutions applied.
1. Meitu Business
Meitu, founded in 2008, offers a wide range of products (MeituPic, Beauty Camera, short‑video community Meipai, Meitu smartphones). The massive user base (hundreds of millions of MAU) puts strict requirements on backend services. Since 2016 the company began exploring container technologies, adopted Kubernetes in 2017, and achieved a largely containerized architecture by 2018. The goals of containerization were to improve online support capability, continuous integration/deployment, resource utilization, and service availability.
2. Containerization Construction
2.1 Before Containerization
Services were deployed on physical machines across multiple IDC locations (Beijing, Ningbo, etc.) and partially on public clouds. Problems included low resource utilization, lack of unified automation, large gaps between test and production environments, high online failure rates, and difficulty migrating during data‑center outages.
2.2 Choosing Kubernetes
After the 2017 “container orchestration war,” Kubernetes emerged as the mature solution. Its powerful scheduling, extensibility, and community support made it the backbone of Meitu’s large‑scale platform.
2.3 Platform Construction
2.3.1 Network
The platform needed to solve five core networking problems: intra‑Pod communication, Pod‑to‑Pod communication, Pod‑to‑Service communication, Service‑to‑external communication, and cross‑cluster/network‑segment communication. Meitu evaluated several CNI plugins (Flannel, OpenContrail, Contiv, Weave, Calico, Romana) and selected Calico because of its performance (close to host), maturity, and BGP‑based extensibility.
Calico creates a sandbox network for each Pod, attaches a veth pair, and propagates routes via the Calico‑Bird component using BGP. It uses BGP within a subnet and IPIP across subnets; however, IPIP’s single‑queue design caused performance bottlenecks, so Meitu disabled IPIP and optimized the network.
Key network improvements:
Multi‑cluster and physical network interconnection.
Removal of IPIP and NAT to boost performance.
Rate‑limiting to protect node networks.
Multi‑cluster connectivity is achieved by deploying Calico‑RR (reflectors) in each data‑center and establishing iBGP between RR and the local gateway; OSPF synchronizes routes between data‑centers. For hybrid‑cloud scenarios, static routes are used to bridge private‑cloud and public‑cloud clusters.
2.3.2 Load Balancing
While Kubernetes provides Service and Ingress, Meitu required a more flexible solution for complex scenarios. After evaluating Nginx and Envoy, the team chose a custom Nginx‑based controller because of existing operational expertise and third‑party extensions. The Custom Load Balancer consists of an Nginx controller that watches Kubernetes resources, updates Nginx configurations, and routes traffic directly to Service Endpoints across clusters.
2.3.3 Logging
Logging is essential for audit, troubleshooting, and monitoring. Meitu adopted a cluster‑level logging architecture: containers output JSON logs to stdout, Fluentd collects them, forwards to Kafka, Logstash consumes and writes to Elasticsearch, and Kibana provides a unified UI.
Challenges included log format inconsistency, container‑to‑host log collection, and varying reliability requirements. Meitu implemented custom adapters for PHP (pipe‑based), big‑data services (direct rootfs collection), and other workloads to standardize log ingestion.
2.3.4 Elastic Scheduling
Kubernetes scheduling follows a two‑stage process (Predicates → Priorities). Meitu refined scheduling by:
Optimizing Pod request values based on historical usage.
Incorporating real‑time node metrics.
Assigning Guaranteed QoS to critical services.
Avoiding pod memory swap.
Enhancing I/O and network isolation.
They also introduced custom controllers for rescheduling, MostRequestedPriority, and manual interventions to mitigate fragmentation and OOM scenarios.
2.3.5 Autoscaling
Horizontal Pod Autoscaler (HPA) is used to scale workloads based on custom metrics (QPS, inbound/outbound bandwidth, queue depth). Meitu added a “slow‑scale‑down” mechanism and a sliding‑window for shrinkage to avoid rapid oscillations during traffic spikes, especially for CPU‑intensive video transcoding services.
Peak‑shaving scheduling is performed by time‑based policies that shift low‑priority batch jobs to off‑peak hours, improving overall cluster utilization.
2.3.6 Monitoring
Monitoring covers physical‑machine metrics (disk, I/O, memory, network), business‑level KPIs, container‑level resource usage, and component health. Data is aggregated across clusters, services, and Pods to produce multi‑dimensional dashboards. Example CPU monitoring charts show spikes that correspond to real incidents.
3. Business Adoption
To lower the entry barrier for services, Meitu provides a unified CI/CD portal, standardized deployment templates, troubleshooting tools, and regular training. Developers push code to GitLab‑CI, which builds images, pushes them to a registry, and the platform automatically creates Deployments in the target clusters.
4. Future Outlook
Meitu plans to operate a multi‑cluster, hybrid‑cloud architecture, integrate more public‑cloud providers, further optimize the scheduler, and explore Service Mesh, Serverless, and edge‑computing technologies to continuously improve the container platform.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meitu Technology
Curating Meitu's technical expertise, valuable case studies, and innovation insights. We deliver quality technical content to foster knowledge sharing between Meitu's tech team and outstanding developers worldwide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
