Implementing a Cloud-Native Istio Gateway for 58.com Deep Learning Inference Platform
This article details the evolution of 58.com’s deep learning inference platform, describing the transition from the original SCF‑based architecture to a cloud‑native Istio gateway (architecture 2.0), and explains design choices, traffic‑management, adaptive rate‑limiting, observability, model pre‑warming, and performance improvements.
The 15th China System Architect Conference (SACC2022) featured a talk by Wei Zhubin on the deep learning inference platform built by 58.com’s AI Lab, focusing on its cloud‑native gateway implementation using Istio.
Background : The platform, named WPAI, provides a unified GPU/CPU resource pool for offline training and online inference, supporting automatic scaling, mixed deployment, and a suite of algorithm application services (NLP, vision, ranking, etc.). Early on, inference services were built on a traditional SCF‑based gateway (Architecture 1.0).
Architecture 1.0 : SCF acted as the API gateway, converting HTTP requests to gRPC for model services. It offered service registration via Kubernetes watch, configuration sync via WConfig, and protocol‑conversion plugins packaged as JARs. While it solved the lack of a platform‑wide inference service, it suffered from complex onboarding, performance overhead from protocol conversion, memory pressure in Netty buffers, and tight coupling with third‑party libraries.
Motivation for Upgrade : Growing traffic, increasing model count (1000+ models, 4000+ nodes), and the need for better scalability and observability prompted a redesign. The team chose Istio as the cloud‑native gateway to address these shortcomings.
Architecture 2.0 : The new design separates three layers – model service, gateway, and business application. Istio’s control plane (Istiod) provides service discovery, traffic management, and security, while the data‑plane Ingress Gateway (Envoy) handles request routing, load balancing, and rate limiting. Sidecar injection was deliberately avoided for inference workloads to eliminate extra latency and resource overhead.
Traffic‑Management Enhancements : Istio’s native traffic‑management features (routing, load‑balancing, fault injection, etc.) replace the custom SCF logic. Multi‑tenant isolation is achieved by deploying dedicated Gateways per namespace where needed. Adaptive rate limiting is implemented via EnvoyFilters that adjust token‑bucket limits based on real‑time replica counts observed from the platform’s monitoring system.
Model Pre‑warming : To avoid cold‑start latency, the platform leverages Kubernetes Startup and Readiness probes. The Startup probe delays service readiness until the model is loaded, while the Readiness probe controls when the service becomes publicly reachable. Model‑specific pre‑warm clients are generated from configuration files, enabling automatic warm‑up traffic before a node receives real requests.
Observability : Instead of relying on Istio’s sidecar metrics, the team collects structured JSON access logs from the Envoy gateway and unstructured logs from inference services, forwarding them to Kafka and processing them with Flink. Metrics, logs, and traces are visualized in Grafana and Elasticsearch, providing fine‑grained monitoring at department, task, and replica levels.
Results : The 2.0 architecture reduced inference latency by over 50%, improved stability through resource isolation, and simplified traffic governance with Istio’s built‑in capabilities. The platform now supports over 1000 models with peak traffic of billions of requests per day.
Conclusion : By adopting a cloud‑native Istio gateway, 58.com’s inference platform achieved significant performance, scalability, and observability gains, and the team plans to continue leveraging emerging Kubernetes and Istio features to further enhance the system.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.