Cloud Native 14 min read

Case Study: Migrating Manbang Group’s Microservices to Kubernetes

Manbang Group transformed thousands of mixed‑deployed Java microservice instances into Docker containers managed by Kubernetes, addressing resource isolation, manual fault handling, slow rollbacks, and DEV/QA release friction while implementing load balancing, service discovery, high availability, rolling upgrades, auto‑scaling, rapid deployment, and resource limits across a multi‑cluster cloud‑native architecture.

Manbang Technology Team

Oct 12, 2018

Case Study: Migrating Manbang Group’s Microservices to Kubernetes

Since Manbang Group’s "YManMan" began its microservice transformation, thousands of Java microservice instances have been running on hundreds of cloud servers, most of them deployed in a mixed fashion across physical and virtual machines.

The management platform, built in‑house combined with open‑source tools, provides basic functions such as packaging, deployment, start, stop, and version rollback via a web UI, but several problems remain:

Resource isolation between instances is poor, especially during peak load or failures, leading to CPU and memory contention on the same server.

When an application instance fails, manual intervention is required, resulting in long outage times.

After a large batch of services is upgraded, rolling back each application to a previous version is time‑consuming.

Frequent DEV/QA releases require stopping the old version before deploying the new one, affecting daily testing.

The rapid growth of the business demands higher system stability, prompting an urgent need to solve the above issues.

Initially attracted by Docker’s isolation and horizontal scaling, the team decided to adopt Docker containers but still needed an orchestration system. After evaluating three options—Kubernetes (K8s), Swarm, and Mesos—GitHub statistics and feature comparison led to the selection of Kubernetes.

Kubernetes was chosen because it can automatically deploy, scale, and manage containerized applications, solving core problems such as:

Load balancing – multiple identical containers are accessed through a unified Service definition.

Service discovery – combined with Kube‑DNS, services can be reached by fixed Service names without extra discovery components.

High availability – health checks automatically restart unhealthy pods.

Rolling upgrades – containers are upgraded one by one to minimize impact.

Auto‑scaling – policies add containers when resource usage is high and remove them when usage drops.

Rapid deployment – pre‑written deployment scripts enable fast environment provisioning.

Resource limits – enforce maximum resource usage per container to protect underlying services.

Further investigation identified the following Kubernetes subsystems and technologies to be used:

Application deployment – K8s Deployments, HPA.

Basic services – K8s DaemonSet, kube‑dns.

External service exposure – K8s Ingress, Traefik, Service.

Network plugin – Flannel.

Monitoring & alerting – Heapster, InfluxDB, Grafana, Prometheus.

Management UI – kubectl, Dashboard, custom elastic‑cloud system.

Image building – Jenkins, Maven, Docker.

Image registry – Harbor.

Log collection – Filebeat, Kafka, ELK stack.

Key migration principles include:

Online services must not be interrupted; traffic is split proportionally and migrated to K8s clusters while ensuring stability.

DEV environments can be batch‑deployed, while QA and Production require careful version dependency handling.

Initially only stateless applications are migrated.

Impact on R&D/QA is minimized.

The Docker‑based release process changed in two major ways:

Previously, WAR/JAR packages were deployed; now Docker images containing those packages are used.

Earlier deployments stopped the old process before starting the new one, causing downtime; now the new container starts first, and the old one is stopped after the new one is healthy, ensuring continuous service.

During migration, the system architecture was divided into internal RPC services (using the Pigeon framework) and external REST APIs, the latter further split into gateway‑connected and non‑gateway services. RPC services and gateway‑connected APIs already have their own service registries, making migration straightforward. Non‑gateway APIs use K8s Ingress for external access.

In the production K8s cluster, a unified external entry is provided by Traefik + Ingress + Nginx, routing HTTP requests based on domain and path to the appropriate service. The online architecture mirrors the offline one but adds full high‑availability using the cloud provider’s SLB.

Container initialization scripts read environment variables to determine the runtime environment (DEV, QA, Production), create appropriate symlinks, set log directories, and launch readiness probes before starting the application.

Log collection is handled in three ways: (1) SSH service inside containers accessed via a web‑based terminal; (2) Direct download of log files from the elastic‑cloud system when containers fail to start; (3) Filebeat containers on each node forward logs to a Kafka cluster, which are then stored in Elasticsearch for later retrieval.

Monitoring uses a Heapster‑InfluxDB‑Grafana stack (with future migration to Prometheus). Dashboards allow filtering by namespace, node, or application name and display CPU, memory, network, and disk usage for troubleshooting and optimization.

Harbor serves as the image registry with a master‑slave topology; images uploaded to the master are synchronized to both an offline slave and an online slave.

An “image tree” concept is employed: a base image hierarchy is built, and most application images inherit from the most similar base, reducing the need for custom Dockerfiles in GitLab. Dockerfiles are generated automatically during packaging based on variables.

Current status:

Containerization: DEV/QA environments are largely Dockerized; most core production applications are also Dockerized.

Self‑healing: OOM or crash events trigger automatic pod replacement (advanced health checks pending).

Elastic scaling: Critical applications have auto‑scaling enabled and perform well under peak load.

Rolling releases: Applications can be updated in batches, with successful batches followed by old version termination.

Fast rollback: Single‑application fast rollback is supported; multi‑application transactional rollback will be enabled via K8s rollout in the future.

Experience and recommendations:

Use CentOS 7.x as the base OS for simplicity.

Migrate from classic Alibaba Cloud ECS to VPC to enable routing to container IPs.

Application‑level monitoring should collect host metrics (Memory, Load Average) from the underlying OS, not the container.

Be aware of ulimit restrictions inside containers; they are not isolated.

Root inside a container may not see process owners created by other users (affects legacy scripts).

Zookeeper’s default single‑IP connection limit of 60 may cause issues after migration.

For high‑traffic applications, pre‑allocate enough containers before scaling down based on monitoring data.

Perform a load test after deploying applications to gauge cluster performance.

Author: Wang Chunlin, a veteran of Tudou, Anjuke, and Manbang Group’s technology assurance department, leading the containerization project from zero to a fully operational K8s cluster.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud native microservices Kubernetes containerization

Written by

Manbang Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.