Tackling Cloud‑Native Ops Challenges: Real‑World Practices from NetEase
NetEase’s cloud‑native operations team shares how they confront new challenges of Kubernetes adoption—ranging from technical stack shifts and knowledge‑base gaps to capacity planning, automated diagnostics, monitoring, alerting, and cost‑saving strategies—offering practical insights for building efficient, stable, and scalable ops systems.
1. New Ops Challenges
New Technology Stack
NetEase’s container team started using Kubernetes early; large‑scale container adoption introduced challenges such as selecting network/storage solutions, capacity planning, and handling bugs in early Docker/Kubernetes versions.
Business teams can call any Kubernetes API, leading to misuse and additional support burden for ops.
The team runs Debian‑based nodes, which differ from the more common CentOS, requiring them to handle newer kernel issues themselves.
Recruiting talent for the new stack is costly.
Technical Inertia
Traditional ops platforms clash with Kubernetes‑based release management, creating gaps in mindset, workflow, and implementation.
Developers often resist container adoption, blaming containers for issues.
Traditional ops methods are not ready for cloud‑native.
Knowledge Base
Documentation is abundant but rarely consulted; teams often bypass docs and rely on ops for troubleshooting, increasing knowledge transfer cost.
Organization and Personnel Structure
In a multi‑BU environment, the classic layered architecture (dev, test, architecture, ops) becomes tangled when containers are introduced, causing overlapping responsibilities and requiring engineers to learn Kubernetes concepts.
Capacity Management
Business teams may request unreasonable resources or experience sudden traffic spikes, while ops often overlook the resource consumption of control‑plane components, leading to capacity shortages.
Example: an API server restart caused a 20% memory surge, triggering alerts and potential cascade failures.
2. Improving Ops Efficiency
Clusters are centrally managed with unified authentication and RBAC. Common troubleshooting steps are automated, and monitoring data is stored in an internal TSDB for further analysis.
Automation is built around CRDs: an
Operationrepresents an atomic task, an
OperationSetcomposes a pipeline, and a
Diagnosiscaptures context.
Triggers include manual requests, chat‑ops bots, and alert‑driven events that collect dumps, upload them, and feed them into analysis pipelines.
Operators can encode legacy scripts as CRDs, making them reusable and version‑controlled.
3. Monitoring and Alerting
Beyond tracing and logging, the focus is on fine‑grained metric collection using eBPF to attribute issues to infrastructure or applications.
Metrics include memory cgroup reclamation, CPU scheduling latency, VFS delays, and network‑level observations via uprobe.
Traditional threshold alerts are being replaced by statistical and lightweight machine‑learning models, with manual feedback loops for tuning and correlation‑based suppression.
4. Cost Savings
Resource pooling (Kubeminer) merges isolated Kubernetes clusters into a shared pool, allowing consumers to schedule pods on virtual nodes provided by other BU’s clusters.
Hybrid deployment combines simple real‑time data‑driven scheduling with isolation techniques (CPU share, hyper‑threading, L3 cache, page‑cache management) to protect online services.
Current results show average CPU utilization around 55% and elastic capacity for video transcoding workloads without consuming dedicated resources.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.