Practical Experience with Envoy in Soul: Cloud‑Native Traffic Management and Service Mesh
This article shares Soul's two‑year practice of using the cloud‑native Envoy proxy for high‑performance, high‑throughput, and highly available traffic management across north‑south and east‑west flows, covering architecture, dynamic service discovery, load balancing, health checks, WASM extensions, service‑mesh integration, Redis proxying, and future directions.
1. What is Envoy
Envoy is an open‑source edge and service proxy designed to make service communication in micro‑service architectures more reliable and efficient. Originally developed by Lyft, it is now a CNCF project.
Key features of Envoy include:
Dynamic configuration and service discovery : via the xDS API (EDS, CDS, LDS, RDS), Envoy can dynamically obtain cluster, endpoint, route, and listener information, adapting to Kubernetes scaling.
Load balancing : supports Round Robin, Random, Weighted Least Request, Ring Hash, Maglev, and cluster‑provided strategies.
Health checks : both active and passive health‑checking mechanisms automatically detect service health and adjust traffic.
Service mesh : each service runs an Envoy sidecar to handle inbound and outbound traffic.
Advanced traffic management : rich routing rules based on path, weight, headers, etc.
Observability : detailed statistics and tracing, compatible with many monitoring tools.
2. Envoy Use Cases in Soul
2.1 Soul Container Gateway
Soul built a custom "Soul Envoy Gateway" on top of the xDS control‑plane, handling all HTTP and gRPC traffic for the container cluster. It replaced the earlier ingress‑nginx controller, which suffered performance and reliability issues in large‑scale clusters.
Reasons for abandoning NGINX Ingress:
Data‑plane and control‑plane not separated : failures in the control plane could crash the data plane.
Lack of dynamic configuration : frequent NGINX reloads caused request spikes and long‑connection interruptions.
The Soul Envoy Gateway separates data and control planes, leverages Kubernetes Ingress and Gateway API, and provides dynamic service discovery, load balancing, health checks, circuit breaking, retries, and more for over ten business clusters.
2.1.1 xDS‑based Service Discovery
Envoy updates listeners, filters, routes, clusters, and endpoints via the control plane. Soul uses a bottom‑up data structure and asynchronous notification queues to minimize the granularity of service‑discovery updates.
Clear hierarchical relationships : the bottom‑up structure defines resource dependencies, reducing errors.
Elastic scaling challenges : sorting endpoint data in Kubernetes reduces update frequency.
Push‑empty protection : prevents empty endpoint lists during scaling or network extremes, avoiding service outages.
2.1.2 Traffic Warm‑up and Service Retries
Slow Java service startups can cause traffic spikes and failures. Envoy’s traffic warm‑up feature smooths traffic during startup, and default retry policies handle network errors such as timeouts, unreachable hosts, TCP resets, and DNS failures.
2.1.3 WASM Filter Extensions
Envoy supports WebAssembly (WASM) extensions for flexible filtering. Soul uses WASM to identify multi‑level forwarded RealIP. TinyGo compiles Go to WASM, while Emscripten compiles C++ to WASM, enabling high‑performance traffic control.
2.2 Service Mesh
Soul initially deployed a commercial Istio version, encountering registration delays. They replaced it with a custom Service Mesh control plane built on Envoy, compatible with istio‑cni.
Control plane : extends Gateway API with VirtualService definitions for east‑west traffic.
Data plane : sidecar Envoy proxies handle inbound and outbound traffic, providing transparent traffic management.
2.2.1 gRPC Load Balancing
Envoy integrates tightly with gRPC (HTTP/2) and supports multiple load‑balancing algorithms, used by Soul’s Triton‑based inference services.
2.2.2 Large‑scale xDS Service Discovery
Incremental update subscription : Envoy receives real‑time incremental updates via xDS.
xDS horizontal scaling : master‑slave replication allows dynamic scaling of xDS servers.
2.2.3 Traffic Monitoring
When using Envoy as a Service Mesh proxy, 流量监控 和 指标收集 is essential for health monitoring, but metric explosion can strain monitoring systems, so pruning Envoy metrics is critical.
2.3 Microservice Gateway
The gateway converts RPC calls to HTTP, embeds authentication, rate limiting, and monitoring, and supports protocol conversion (e.g., Dubbo), request signing, and authorization via Envoy’s external authorization mechanism.
2.4 Redis Proxy
Envoy is used as a Redis proxy to handle millions of QPS, providing prefix routing, traffic mirroring, auto‑refresh, and read‑strategy controls for horizontal scaling.
3. Challenges and Sharing
3.1 Compatibility with Ingress NGINX Annotations
Soul maps NGINX Ingress annotations (path rewrite, traffic control, timeout settings) to Envoy configurations, enabling seamless migration without full re‑configuration.
3.2 eBPF and Envoy Collaboration
Offloading L4 processing to eBPF while keeping L7 logic in Envoy improves performance and scalability. Use cases include TCP long‑tail latency reduction, DDoS mitigation, and service‑mesh RT optimization.
4. Outlook
In 2024, large‑model AIGC remains hot; cloud computing will support AI beyond GPU compute, covering big data, vector search, and traffic governance. Soul plans to use Envoy for AI‑native traffic proxying, continue Service Mesh development, and provide unified proxy services for VM‑based scenarios.
Soul Technical Team
Technical practice sharing from Soul
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.