How NetEase Yanxuan Scaled with Service Mesh: Evolution, Challenges, and Cloud‑Native Solutions
This article details Yanxuan's journey from early exploration to full rollout of Service Mesh, covering architectural choices, hybrid‑cloud deployment, performance comparisons between cNginx and Envoy, ongoing evolution plans, and lessons learned for large‑scale microservice systems.
Background
Yanxuan selected Service Mesh in 2016 as the foundation for its micro‑service transformation and has since supported rapid business growth. The presentation outlines the adoption, evolution, hybrid‑cloud challenges, and the solutions implemented.
Phase 1: Exploration (late 2015 ~ Apr 2016)
Yanxuan began with a small team (~10 people) using a monolithic architecture and a few basic services (push, file storage, message center). The internal mail division employed a mixed SOA approach with both centralized ESB and decentralized Spring Cloud, exposing typical service‑governance problems.
Service governance: RPC framework vs. dedicated governance platform.
Multi‑language support: Java core services coexist with Python recommendation, C++ access, and Node.js applications, raising cross‑language governance costs.
Open‑source vs. self‑built: Should the foundation be built from scratch or extended from mature open‑source projects, and what value does community contribution bring?
Phase 2: Small‑scale trial (Apr 2016 ~ early 2017)
In July 2016 the first‑generation Service Mesh was released and piloted in NetEase Mail, NetEase YouQian, and parts of Yanxuan, delivering solid operational experience and a basic control platform.
Phase 3: Full rollout (2017 ~ present)
Team size grew from 10 to over 200. From early 2017 the first‑generation mesh was fully deployed; in 2019, with the maturation of the container cloud platform “Light Boat,” Yanxuan launched a cloud‑native strategy and began upgrading the mesh architecture.
First‑generation Service Mesh Architecture
Built on Consul (service discovery, registration, routing) and Nginx (high‑performance reverse proxy, load‑balancing, rate‑limiting). Consul and Nginx were fused into a local proxy called cNginx , and a management platform exposed the capabilities.
Data plane: cNginx + Consul client form a sidecar using the client‑sidecar model.
Control plane: Provides service registration/discovery, call control, and governance control.
Service Governance Capabilities
The mesh offers registration/discovery, health checks, routing, load‑balancing, failover, client‑side rate limiting, timeout, retries, and integrates access control, resource isolation, monitoring, and fault diagnosis via middleware.
Benefits of Service Mesh for Yanxuan
Eliminates legacy technical debt by adding governance without code changes.
Reduces middleware development and evolution costs; decouples business from middleware.
Allows independent evolution of infrastructure and business architectures.
Provides unified governance for multi‑language stacks, enabling non‑Java services to benefit from the same capabilities.
Continuous Evolution Needs
Future enhancements include richer traffic management (traffic‑splitting, coloring), additional governance features (rate‑limiting, circuit‑breaking, fault injection), broader protocol support, and stronger control‑plane capabilities, as well as full cloud‑native and multi‑cloud support.
Industry‑wide Service Mesh Evolution
Service Mesh was first publicly defined in September 2016 by Linkerd’s CEO and contributed to CNCF. Subsequent projects include Lyft’s Envoy and Nginx’s nginmesh. Istio, introduced later, added powerful control‑plane features and quickly became the de‑facto standard.
Istio‑based Cloud‑Native Service Mesh
Istio integrates tightly with Kubernetes, supplementing its native traffic management with advanced features such as rate‑limiting, circuit‑breaking, and fault injection. It runs as a sidecar (Envoy) injected into pods, making deployment transparent to applications.
Data plane: Envoy supports HTTP 1.x/2.x, gRPC, and enforces policies from the control plane.
Control plane components: Pilot (service discovery & config distribution), Mixer (access control & telemetry), Citadel (certificate management), Galley (configuration validation).
Performance Comparison: cNginx vs. Envoy
Initial tests on an 8C16G host at 40 concurrency / 1600 RPS showed cNginx added ~0.4 ms latency, while unoptimized Envoy added ~0.6 ms compared to direct calls. After optimizations (SR‑IOV container networking and back‑porting Envoy’s connection load‑balancer), Envoy’s latency increased only 0.2‑0.6 ms in low concurrency and matched VM‑based direct calls at high concurrency, confirming acceptable performance overhead.
Current Evolution Direction
Yanxuan is moving to an Istio + Envoy solution with:
Envoy as the data‑plane proxy.
Pilot as the core control‑plane component.
Platform extensions via Kubernetes CRDs and Mesh Configuration Protocol (MCP).
High‑availability design based on Kubernetes and Istio mechanisms.
Hybrid‑Cloud Deployment Practice
Yanxuan’s cloud‑migration roadmap consists of three stages: IDC (private‑cloud VM deployment), hybrid‑cloud (mixed VM and container workloads with services spanning environments), and multi‑cloud (full container deployment across providers). Key steps include embracing cloud‑native, building a unified service‑governance platform, creating a unified deployment platform, and implementing gray‑release mechanisms.
Quality Assurance System
Establish CI/CD pipelines with unit and integration testing.
Automated performance benchmarks and continuous monitoring.
Comprehensive monitoring and alerting for infrastructure health.
Version‑upgrade mechanisms, including Envoy hot‑updates, gray‑release, and multi‑environment promotion.
Business regression verification processes.
Pitfalls Encountered
Envoy bug causing crashes under certain load when access‑log configuration is enabled; community plans to remove the faulty assertion.
Mixer performance bottleneck when policy checks are enabled; moving to Istio RBAC mitigates the issue.
Planning and Outlook
Future work focuses on performance (eBPF/xDP or DPDK+Fstack optimizations) and richer governance features. Sidecar mode will adopt the eBPF/xDP path, while gateway mode will explore DPDK+Fstack.
Conclusion
The talk presented Yanxuan’s Service Mesh evolution, its role in hybrid‑cloud deployment, encountered challenges, and ongoing work on performance and governance. The experience demonstrates that Service Mesh maturity now supports large‑scale production use and offers valuable lessons for the community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Yanxuan Tech Team
NetEase Yanxuan Tech Team shares e-commerce tech insights and quality finds for mindful living. This is the public portal for NetEase Yanxuan's technology and product teams, featuring weekly tech articles, team activities, and job postings.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
