Cloud Native 24 min read

How NetEase Yanxuan Scaled with Service Mesh: Evolution, Challenges, and Cloud‑Native Solutions

This article details Yanxuan's journey from early exploration to full rollout of Service Mesh, covering architectural choices, hybrid‑cloud deployment, performance comparisons between cNginx and Envoy, ongoing evolution plans, and lessons learned for large‑scale microservice systems.

Yanxuan Tech Team
Yanxuan Tech Team
Yanxuan Tech Team
How NetEase Yanxuan Scaled with Service Mesh: Evolution, Challenges, and Cloud‑Native Solutions

Background

Yanxuan selected Service Mesh in 2016 as the foundation for its micro‑service transformation and has since supported rapid business growth. The presentation outlines the adoption, evolution, hybrid‑cloud challenges, and the solutions implemented.

Phase 1: Exploration (late 2015 ~ Apr 2016)

Yanxuan began with a small team (~10 people) using a monolithic architecture and a few basic services (push, file storage, message center). The internal mail division employed a mixed SOA approach with both centralized ESB and decentralized Spring Cloud, exposing typical service‑governance problems.

Service governance: RPC framework vs. dedicated governance platform.

Multi‑language support: Java core services coexist with Python recommendation, C++ access, and Node.js applications, raising cross‑language governance costs.

Open‑source vs. self‑built: Should the foundation be built from scratch or extended from mature open‑source projects, and what value does community contribution bring?

Phase 2: Small‑scale trial (Apr 2016 ~ early 2017)

In July 2016 the first‑generation Service Mesh was released and piloted in NetEase Mail, NetEase YouQian, and parts of Yanxuan, delivering solid operational experience and a basic control platform.

Phase 3: Full rollout (2017 ~ present)

Team size grew from 10 to over 200. From early 2017 the first‑generation mesh was fully deployed; in 2019, with the maturation of the container cloud platform “Light Boat,” Yanxuan launched a cloud‑native strategy and began upgrading the mesh architecture.

First‑generation Service Mesh Architecture

Built on Consul (service discovery, registration, routing) and Nginx (high‑performance reverse proxy, load‑balancing, rate‑limiting). Consul and Nginx were fused into a local proxy called cNginx , and a management platform exposed the capabilities.

Data plane: cNginx + Consul client form a sidecar using the client‑sidecar model.

Control plane: Provides service registration/discovery, call control, and governance control.

Service Governance Capabilities

The mesh offers registration/discovery, health checks, routing, load‑balancing, failover, client‑side rate limiting, timeout, retries, and integrates access control, resource isolation, monitoring, and fault diagnosis via middleware.

Benefits of Service Mesh for Yanxuan

Eliminates legacy technical debt by adding governance without code changes.

Reduces middleware development and evolution costs; decouples business from middleware.

Allows independent evolution of infrastructure and business architectures.

Provides unified governance for multi‑language stacks, enabling non‑Java services to benefit from the same capabilities.

Continuous Evolution Needs

Future enhancements include richer traffic management (traffic‑splitting, coloring), additional governance features (rate‑limiting, circuit‑breaking, fault injection), broader protocol support, and stronger control‑plane capabilities, as well as full cloud‑native and multi‑cloud support.

Industry‑wide Service Mesh Evolution

Service Mesh was first publicly defined in September 2016 by Linkerd’s CEO and contributed to CNCF. Subsequent projects include Lyft’s Envoy and Nginx’s nginmesh. Istio, introduced later, added powerful control‑plane features and quickly became the de‑facto standard.

Istio‑based Cloud‑Native Service Mesh

Istio integrates tightly with Kubernetes, supplementing its native traffic management with advanced features such as rate‑limiting, circuit‑breaking, and fault injection. It runs as a sidecar (Envoy) injected into pods, making deployment transparent to applications.

Data plane: Envoy supports HTTP 1.x/2.x, gRPC, and enforces policies from the control plane.

Control plane components: Pilot (service discovery & config distribution), Mixer (access control & telemetry), Citadel (certificate management), Galley (configuration validation).

Performance Comparison: cNginx vs. Envoy

Initial tests on an 8C16G host at 40 concurrency / 1600 RPS showed cNginx added ~0.4 ms latency, while unoptimized Envoy added ~0.6 ms compared to direct calls. After optimizations (SR‑IOV container networking and back‑porting Envoy’s connection load‑balancer), Envoy’s latency increased only 0.2‑0.6 ms in low concurrency and matched VM‑based direct calls at high concurrency, confirming acceptable performance overhead.

Current Evolution Direction

Yanxuan is moving to an Istio + Envoy solution with:

Envoy as the data‑plane proxy.

Pilot as the core control‑plane component.

Platform extensions via Kubernetes CRDs and Mesh Configuration Protocol (MCP).

High‑availability design based on Kubernetes and Istio mechanisms.

Hybrid‑Cloud Deployment Practice

Yanxuan’s cloud‑migration roadmap consists of three stages: IDC (private‑cloud VM deployment), hybrid‑cloud (mixed VM and container workloads with services spanning environments), and multi‑cloud (full container deployment across providers). Key steps include embracing cloud‑native, building a unified service‑governance platform, creating a unified deployment platform, and implementing gray‑release mechanisms.

Quality Assurance System

Establish CI/CD pipelines with unit and integration testing.

Automated performance benchmarks and continuous monitoring.

Comprehensive monitoring and alerting for infrastructure health.

Version‑upgrade mechanisms, including Envoy hot‑updates, gray‑release, and multi‑environment promotion.

Business regression verification processes.

Pitfalls Encountered

Envoy bug causing crashes under certain load when access‑log configuration is enabled; community plans to remove the faulty assertion.

Mixer performance bottleneck when policy checks are enabled; moving to Istio RBAC mitigates the issue.

Planning and Outlook

Future work focuses on performance (eBPF/xDP or DPDK+Fstack optimizations) and richer governance features. Sidecar mode will adopt the eBPF/xDP path, while gateway mode will explore DPDK+Fstack.

Conclusion

The talk presented Yanxuan’s Service Mesh evolution, its role in hybrid‑cloud deployment, encountered challenges, and ongoing work on performance and governance. The experience demonstrates that Service Mesh maturity now supports large‑scale production use and offers valuable lessons for the community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance optimizationHybrid Cloud
Yanxuan Tech Team
Written by

Yanxuan Tech Team

NetEase Yanxuan Tech Team shares e-commerce tech insights and quality finds for mindful living. This is the public portal for NetEase Yanxuan's technology and product teams, featuring weekly tech articles, team activities, and job postings.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.