How 58 Tongcheng Built a Cloud‑Native Deep Learning Inference Platform with Istio
This article details the evolution of 58 Tongcheng's deep learning inference platform—from the initial WPAI‑based architecture to a cloud‑native, Istio‑powered design—covering its background, technical challenges, architectural redesign, traffic‑management features, adaptive rate limiting, model warm‑up, and observability improvements.
Background
58 Tongcheng’s AI Lab started building the WPAI (Wuba Platform of AI) in 2017 to provide a one‑stop development environment for machine‑learning models, centralizing GPU/CPU/NPU resources and supporting both offline training and online inference with elastic scaling.
WPAI Platform Overview
WPAI consists of a basic compute platform that manages hardware resources and a set of algorithm application platforms (e.g., WubaNLP, Phoenix image platform, ranking learning platform). These platforms expose common models via a web UI, allowing developers to configure training and inference with minimal effort.
Additional subsystems such as the vector‑search service vSearch and the AB‑testing platform SunDial further improve AI engineering efficiency.
Inference Architecture 1.0
The first version of the inference platform was built on the SCF (a Java‑based RPC framework) gateway and Kubernetes. Its design included:
Control plane: service registration via K8s List/Watch, model runtime parameters synced through the WConfig configuration center, and a protocol‑conversion JAR plugin hub built on the WOS object‑storage service.
Data plane: a request‑processing pipeline handling authentication, per‑second rate limiting, JAR hot‑loading, request/response conversion, weighted load balancing, traffic forwarding, and error handling.
While 1.0 solved the lack of a unified inference platform, it revealed several shortcomings as usage grew:
Complex model‑access integration due to required JAR plugins and lack of HTTP support.
Performance penalties from SCF‑to‑gRPC conversion and Netty buffer allocation, leading to GC‑induced latency spikes.
High operational cost because the gateway tightly coupled with third‑party libraries (e.g., Log4j) and required full upgrades for any change.
Motivation for a New Architecture
Increasing tenant count and model‑iteration frequency demanded a more scalable, observable, and maintainable solution. The team therefore decided to adopt a cloud‑native gateway based on Istio.
Inference Architecture 2.0 – Istio Cloud‑Native Gateway
The 2.0 architecture separates the system into three layers: model‑service layer (unchanged from 1.0), Istio‑based gateway layer, and business‑application layer.
Key decisions:
Sidecar injection was avoided because inference traffic is purely end‑to‑end, and sidecars would add unnecessary latency and resource overhead.
Istio’s control plane (Istiod) provides Citadel (identity & credential management), Galley (configuration validation), and Pilot (service discovery & traffic management). The data plane consists of an Envoy Ingress Gateway and a Pilot‑agent.
The gateway layer handles tenant isolation, dynamic traffic routing, and integrates with a K8s Manager Service that standardizes operations on K8s and Istio resources.
Traffic‑Management Enhancements
Istio enables rich traffic‑management capabilities such as:
Gateway resources for L4‑L6 load‑balancing.
VirtualService for L7 routing, header manipulation, and retries.
DestinationRule for load‑balancing strategies and circuit‑breaker settings.
EnvoyFilter for custom plugins (e.g., metrics, rate limiting).
Multi‑tenant isolation is achieved by deploying separate Gateway instances per namespace where needed, while sharing a gateway for low‑traffic namespaces.
Adaptive Rate Limiting
To avoid the performance drawbacks of global rate limiting, a local token‑bucket limiter is used via EnvoyFilter. The system monitors task replica counts through the platform’s observability stack, debounces changes, computes the total QPS (replicas × per‑replica QPS), and updates the EnvoyFilter configuration via the K8s Manager Service.
Model Warm‑Up (Zero‑Loss Deployment)
New inference nodes load model files on the first request, causing latency spikes. The platform introduces a warm‑up flow that triggers model loading before the node receives traffic. This is achieved by configuring Kubernetes Startup and Readiness probes: the Readiness probe’s initialDelaySeconds is set to the desired warm‑up duration, ensuring the container is only marked ready after the model is fully loaded.
Model‑specific warm‑up clients are generated from a declarative configuration file, allowing algorithms to simply upload the config and let the platform handle the warm‑up requests.
Observability Construction
Observability is built on three pillars: Metrics, Logs, and Traces. The platform collects:
Service‑level metrics from structured JSON gateway logs, processed by Flink to produce multi‑dimensional, hierarchical metrics stored in Elasticsearch and Kafka.
Resource metrics from cAdvisor, scraped by Prometheus and forwarded to Kafka via a Prometheus‑Kafka adapter.
Logs (both structured gateway logs and unstructured inference service logs) shipped to Kafka and visualized through the ELK stack.
Grafana dashboards provide real‑time visibility, supporting alerting, traffic‑management decisions, and auto‑scaling.
Results and Conclusion
The migration to the Istio‑based 2.0 architecture yields more than 50 % reduction in inference latency, improved stability through resource isolation, and richer traffic‑management features that simplify operations. Ongoing work focuses on leveraging newer K8s and Istio capabilities to further enhance performance and observability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
