Operations 22 min read

Tackling Cloud‑Native Ops Challenges: Real‑World Practices from NetEase

NetEase’s cloud‑native operations team shares how they confront new challenges of Kubernetes adoption—ranging from technical stack shifts and knowledge‑base gaps to capacity planning, automated diagnostics, monitoring, alerting, and cost‑saving strategies—offering practical insights for building efficient, stable, and scalable ops systems.

Efficient Ops
Efficient Ops
Efficient Ops
Tackling Cloud‑Native Ops Challenges: Real‑World Practices from NetEase

1. New Ops Challenges

New Technology Stack

NetEase’s container team started using Kubernetes early; large‑scale container adoption introduced challenges such as selecting network/storage solutions, capacity planning, and handling bugs in early Docker/Kubernetes versions.

Business teams can call any Kubernetes API, leading to misuse and additional support burden for ops.

The team runs Debian‑based nodes, which differ from the more common CentOS, requiring them to handle newer kernel issues themselves.

Recruiting talent for the new stack is costly.

Technical Inertia

Traditional ops platforms clash with Kubernetes‑based release management, creating gaps in mindset, workflow, and implementation.

Developers often resist container adoption, blaming containers for issues.

Traditional ops methods are not ready for cloud‑native.

Knowledge Base

Documentation is abundant but rarely consulted; teams often bypass docs and rely on ops for troubleshooting, increasing knowledge transfer cost.

Organization and Personnel Structure

In a multi‑BU environment, the classic layered architecture (dev, test, architecture, ops) becomes tangled when containers are introduced, causing overlapping responsibilities and requiring engineers to learn Kubernetes concepts.

Capacity Management

Business teams may request unreasonable resources or experience sudden traffic spikes, while ops often overlook the resource consumption of control‑plane components, leading to capacity shortages.

Example: an API server restart caused a 20% memory surge, triggering alerts and potential cascade failures.

2. Improving Ops Efficiency

Clusters are centrally managed with unified authentication and RBAC. Common troubleshooting steps are automated, and monitoring data is stored in an internal TSDB for further analysis.

Automation is built around CRDs: an

Operation

represents an atomic task, an

OperationSet

composes a pipeline, and a

Diagnosis

captures context.

Triggers include manual requests, chat‑ops bots, and alert‑driven events that collect dumps, upload them, and feed them into analysis pipelines.

Operators can encode legacy scripts as CRDs, making them reusable and version‑controlled.

3. Monitoring and Alerting

Beyond tracing and logging, the focus is on fine‑grained metric collection using eBPF to attribute issues to infrastructure or applications.

Metrics include memory cgroup reclamation, CPU scheduling latency, VFS delays, and network‑level observations via uprobe.

Traditional threshold alerts are being replaced by statistical and lightweight machine‑learning models, with manual feedback loops for tuning and correlation‑based suppression.

4. Cost Savings

Resource pooling (Kubeminer) merges isolated Kubernetes clusters into a shared pool, allowing consumers to schedule pods on virtual nodes provided by other BU’s clusters.

Hybrid deployment combines simple real‑time data‑driven scheduling with isolation techniques (CPU share, hyper‑threading, L3 cache, page‑cache management) to protect online services.

Current results show average CPU utilization around 55% and elastic capacity for video transcoding workloads without consuming dedicated resources.

MonitoringCloud NativeautomationOperationsKubernetescost optimization
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.