Cloud Native 22 min read

How NetEase Media Scaled Its Infrastructure with Containerization and Service Mesh

NetEase Media transformed its infrastructure by containerizing services, establishing multiple resource pools, implementing a ServiceMesh with NSF, and isolating beta and production environments, resulting in higher CPU utilization, automated scaling, and improved stability, while sharing lessons learned and future plans.

NetEase Media Technology Team

Aug 13, 2020

How NetEase Media Scaled Its Infrastructure with Containerization and Service Mesh

1. Problems and Upgrade Goals

The media platform faced several challenges as its business grew:

Resource utilization was only about 20% with large idle periods.

Manual, lengthy approval processes made scaling during traffic spikes slow.

Frequent external data crawling caused service instability and traffic spikes.

Manual scaling and descaling required significant human effort.

Unclear service dependency topology made migrations risky.

To address these issues, the team set the following upgrade objectives:

Raise overall resource utilization to 50‑60% and better use idle capacity.

Eliminate cumbersome processes so that teams can self‑service resources from a pool.

Strengthen security by adding a unified gateway with circuit‑breaker, audit, and traffic‑control capabilities.

Ensure stability with fast failover, rapid scaling during spikes, and automatic descaling after traffic subsides.

Provide quick, accurate visibility of service dependencies and URL call graphs.

2. Infrastructure Evolution

The architecture progressed through four stages:

Physical‑machine stage: early deployments on bare metal.

Virtualization stage: migration to a private cloud to cut costs.

Containerization stage: migration to a container cloud, pooling resources for on‑demand usage.

Container‑cloud upgrade stage: further automation of resource management on top of Kubernetes.

3. Resource Pool Division

Based on business requirements, three primary pools were created:

Container resource pool for all stateless services.

Cloud‑host resource pool for stateful services that still rely on rsync, IP binding, etc.

Physical‑machine pool for heavyweight applications (recommendations, algorithms) that remain on bare metal.

The container pool is further split into seven logical pools:

APP – general business applications.

Redis – isolated for sensitive Redis workloads.

PUSH – high‑peak push services, isolated to protect other workloads.

Rec – recommendation services requiring dedicated resources.

Kafka – Kafka Operator services.

ES – Elasticsearch Operator services.

GW – gateway services.

4. Service Calls Between Containers and VMs

External‑to‑external services: use the internal NDSF framework with Consul as the service registry.

Internal‑to‑internal services: direct ServiceName calls.

Internal‑to‑external services: NDSF + Consul.

External‑to‑internal services: all traffic passes through a unified gateway.

5. Beta and Production Environments

A completely isolated Beta environment was created with its own VPC, container cloud, and cloud‑host pools. By default Beta cannot reach production; a whitelist security group allows selective access when needed.

6. IP Planning

The network is segmented into multiple CIDR blocks:

Container public subnet – internet‑accessible.

Container private subnet – internal only.

Cloud‑host public subnet.

Cloud‑host private subnet.

Gateway subnet for container gateways.

Redis, Kafka, ES subnets.

Control plane subnet for Kubernetes components.

7. Technical Stack

The private cloud runs NetEase Hangyan Cloud 1.0 (OpenStack‑based). The container cloud uses the Lightboat PaaS platform built on upstream Kubernetes, enhanced for production stability. Key components include:

NCS – core container service.

NSF – ServiceMesh based on Envoy and Istio, with Dubbo/Thrift support.

Mixed‑workload platform – improves CPU utilization.

Container gateway – unified entry point with security, audit, circuit‑breaker, monitoring.

Redis Operator, Kafka Operator, HPA for elastic scaling.

8. Current State

After roughly one year, the platform runs tens of thousands of Pods and over a thousand Services. More than 80% of former physical‑machine capacity has been migrated to the container pool.

9. ServiceMesh Adoption

Previously, the in‑house NDSF framework handled service registration (Consul) and call routing via HTTP/2.0, offering Jar‑based and Proxy‑based invocation. Limitations included upgrade friction, Jar conflicts, and required code changes.

Jar‑based calls: Java services import NDSF Jar.

Proxy‑based calls: non‑Java services run a sidecar Proxy.

To achieve zero‑touch migration, the team switched to a ServiceMesh approach, aiming for:

Business‑agnostic integration – no code changes.

Language‑agnostic support – Java, Go, etc.

Multi‑protocol (Dubbo, Thrift, HTTP, gRPC) handling.

Dynamic configuration – upgrades without downtime.

Full governance – retries, circuit‑breaker, rate‑limiting, black/white‑list.

10. ServiceMesh Implementation Details

NSF (based on Istio/Envoy) was extended to support Dubbo. The flow includes iptables redirection to Envoy, zk‑based service discovery, and custom Galley extensions that publish Dubbo service entries to Pilot.

Business uses Dubbo to call target IP.

iptables intercepts traffic on selected ports and redirects to Envoy.

zk remains as the registration source; Galley pulls Dubbo metadata.

Pilot distributes XDS configs based on service entries.

VirtualService definition example:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: {application-name}
  namespace: default
spec:
  hosts:
  - {applocation-name}.{namespace}.svc.cluster.local
  dubbo:
  - match:
    - interface:
        exact: {interfaceName}
      group: {groupname}
      version: {version}
      methods:
      - name:
        regex: A #{methodName}
        params:
          0: exact_match: "test"
          1:
            range_match:
              start: -10
              end: 0
        attachments:
          key-1:
            prefix: value-1
      - name:
        regex: B #{methodName}
        attachments:
          key-2:
            prefix: value-1
    route:
    - destination:
        host: {applocation-name}.{namespace}.svc.cluster.local
        port:
          number: {dubbo-port}
        subset: v1
        weight: 100

11. Mixed Workload (Co‑location) Strategy

To improve CPU utilization, workloads are classified:

Online resources – only schedule latency‑sensitive services.

Offline resources – schedule batch jobs.

Mixed resources – allow both online and offline workloads.

Four business types run in containers:

Job – offline tasks.

Service – online services.

Colocation‑job – offline jobs allowed in mixed pool.

Colocation‑service – online services allowed in mixed pool.

Zeus, the co‑location system from the Lightboat team, enforces these policies while maximizing CPU usage.

12. Pitfalls Encountered

Capacity assessment is mandatory – evaluate network, QPS, CPU, memory, disk, LB, gateway bandwidth before migration.

Too many node labels – excessive tags increase management complexity; keep pool tags minimal (APP, GW, Redis, Kafka, PUSH, Rec, ES).

Service creation must be deliberate – unnecessary Service resources cause frequent Endpoint updates and iptables flushes.

Container IP granularity – avoid fine‑grained IP whitelisting for databases; use broader subnet ranges instead.

13. Future Plans

Fine‑grained resource management to narrow the gap between requested and actual usage.

Automatic resource recommendation based on historical usage.

Enhanced elasticity using HPA and exploring VPA.

Broader ServiceMesh rollout to all business units.

Comprehensive capacity monitoring (CPU, memory, network, disk) for accurate new‑service evaluation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native kubernetes Resource Management containerization Service Mesh infrastructure

Written by

NetEase Media Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.