Operations 22 min read

Tackling Cloud‑Native Ops Challenges: Real‑World Practices from NetEase

NetEase’s cloud‑native operations team shares how they confront new challenges of Kubernetes adoption—ranging from technical stack shifts and knowledge‑base gaps to capacity planning, automated diagnostics, monitoring, alerting, and cost‑saving strategies—offering practical insights for building efficient, stable, and scalable ops systems.

Efficient Ops

Feb 22, 2022

Tackling Cloud‑Native Ops Challenges: Real‑World Practices from NetEase

1. New Ops Challenges

New Technology Stack

NetEase’s container team started using Kubernetes early; large‑scale container adoption introduced challenges such as selecting network/storage solutions, capacity planning, and handling bugs in early Docker/Kubernetes versions.

Business teams can call any Kubernetes API, leading to misuse and additional support burden for ops.

The team runs Debian‑based nodes, which differ from the more common CentOS, requiring them to handle newer kernel issues themselves.

Recruiting talent for the new stack is costly.

Technical Inertia

Traditional ops platforms clash with Kubernetes‑based release management, creating gaps in mindset, workflow, and implementation.

Developers often resist container adoption, blaming containers for issues.

Traditional ops methods are not ready for cloud‑native.

Knowledge Base

Documentation is abundant but rarely consulted; teams often bypass docs and rely on ops for troubleshooting, increasing knowledge transfer cost.

Organization and Personnel Structure

In a multi‑BU environment, the classic layered architecture (dev, test, architecture, ops) becomes tangled when containers are introduced, causing overlapping responsibilities and requiring engineers to learn Kubernetes concepts.

Capacity Management

Business teams may request unreasonable resources or experience sudden traffic spikes, while ops often overlook the resource consumption of control‑plane components, leading to capacity shortages.

Example: an API server restart caused a 20% memory surge, triggering alerts and potential cascade failures.

2. Improving Ops Efficiency

Clusters are centrally managed with unified authentication and RBAC. Common troubleshooting steps are automated, and monitoring data is stored in an internal TSDB for further analysis.

Automation is built around CRDs: an Operation represents an atomic task, an OperationSet composes a pipeline, and a Diagnosis captures context.

Triggers include manual requests, chat‑ops bots, and alert‑driven events that collect dumps, upload them, and feed them into analysis pipelines.

Operators can encode legacy scripts as CRDs, making them reusable and version‑controlled.

3. Monitoring and Alerting

Beyond tracing and logging, the focus is on fine‑grained metric collection using eBPF to attribute issues to infrastructure or applications.

Metrics include memory cgroup reclamation, CPU scheduling latency, VFS delays, and network‑level observations via uprobe.

Traditional threshold alerts are being replaced by statistical and lightweight machine‑learning models, with manual feedback loops for tuning and correlation‑based suppression.

4. Cost Savings

Resource pooling (Kubeminer) merges isolated Kubernetes clusters into a shared pool, allowing consumers to schedule pods on virtual nodes provided by other BU’s clusters.

Hybrid deployment combines simple real‑time data‑driven scheduling with isolation techniques (CPU share, hyper‑threading, L3 cache, page‑cache management) to protect online services.

Current results show average CPU utilization around 55% and elastic capacity for video transcoding workloads without consuming dedicated resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native automation Kubernetes cost optimization

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.