Cloud Native 24 min read

How 58 Tongcheng Built a Cloud‑Native Deep Learning Inference Platform with Istio

This article details the evolution of 58 Tongcheng's deep learning inference platform—from the initial WPAI‑based architecture to a cloud‑native, Istio‑powered design—covering its background, technical challenges, architectural redesign, traffic‑management features, adaptive rate limiting, model warm‑up, and observability improvements.

ITPUB

Dec 22, 2022

How 58 Tongcheng Built a Cloud‑Native Deep Learning Inference Platform with Istio

Background

58 Tongcheng’s AI Lab started building the WPAI (Wuba Platform of AI) in 2017 to provide a one‑stop development environment for machine‑learning models, centralizing GPU/CPU/NPU resources and supporting both offline training and online inference with elastic scaling.

WPAI Platform Overview

WPAI consists of a basic compute platform that manages hardware resources and a set of algorithm application platforms (e.g., WubaNLP, Phoenix image platform, ranking learning platform). These platforms expose common models via a web UI, allowing developers to configure training and inference with minimal effort.

Additional subsystems such as the vector‑search service vSearch and the AB‑testing platform SunDial further improve AI engineering efficiency.

Inference Architecture 1.0

The first version of the inference platform was built on the SCF (a Java‑based RPC framework) gateway and Kubernetes. Its design included:

Control plane: service registration via K8s List/Watch, model runtime parameters synced through the WConfig configuration center, and a protocol‑conversion JAR plugin hub built on the WOS object‑storage service.

Data plane: a request‑processing pipeline handling authentication, per‑second rate limiting, JAR hot‑loading, request/response conversion, weighted load balancing, traffic forwarding, and error handling.

While 1.0 solved the lack of a unified inference platform, it revealed several shortcomings as usage grew:

Complex model‑access integration due to required JAR plugins and lack of HTTP support.

Performance penalties from SCF‑to‑gRPC conversion and Netty buffer allocation, leading to GC‑induced latency spikes.

High operational cost because the gateway tightly coupled with third‑party libraries (e.g., Log4j) and required full upgrades for any change.

Motivation for a New Architecture

Increasing tenant count and model‑iteration frequency demanded a more scalable, observable, and maintainable solution. The team therefore decided to adopt a cloud‑native gateway based on Istio.

Inference Architecture 2.0 – Istio Cloud‑Native Gateway

The 2.0 architecture separates the system into three layers: model‑service layer (unchanged from 1.0), Istio‑based gateway layer, and business‑application layer.

Key decisions:

Sidecar injection was avoided because inference traffic is purely end‑to‑end, and sidecars would add unnecessary latency and resource overhead.

Istio’s control plane (Istiod) provides Citadel (identity & credential management), Galley (configuration validation), and Pilot (service discovery & traffic management). The data plane consists of an Envoy Ingress Gateway and a Pilot‑agent.

The gateway layer handles tenant isolation, dynamic traffic routing, and integrates with a K8s Manager Service that standardizes operations on K8s and Istio resources.

Traffic‑Management Enhancements

Istio enables rich traffic‑management capabilities such as:

Gateway resources for L4‑L6 load‑balancing.

VirtualService for L7 routing, header manipulation, and retries.

DestinationRule for load‑balancing strategies and circuit‑breaker settings.

EnvoyFilter for custom plugins (e.g., metrics, rate limiting).

Multi‑tenant isolation is achieved by deploying separate Gateway instances per namespace where needed, while sharing a gateway for low‑traffic namespaces.

Adaptive Rate Limiting

To avoid the performance drawbacks of global rate limiting, a local token‑bucket limiter is used via EnvoyFilter. The system monitors task replica counts through the platform’s observability stack, debounces changes, computes the total QPS (replicas × per‑replica QPS), and updates the EnvoyFilter configuration via the K8s Manager Service.

Model Warm‑Up (Zero‑Loss Deployment)

New inference nodes load model files on the first request, causing latency spikes. The platform introduces a warm‑up flow that triggers model loading before the node receives traffic. This is achieved by configuring Kubernetes Startup and Readiness probes: the Readiness probe’s initialDelaySeconds is set to the desired warm‑up duration, ensuring the container is only marked ready after the model is fully loaded.

Model‑specific warm‑up clients are generated from a declarative configuration file, allowing algorithms to simply upload the config and let the platform handle the warm‑up requests.

Observability Construction

Observability is built on three pillars: Metrics, Logs, and Traces. The platform collects:

Service‑level metrics from structured JSON gateway logs, processed by Flink to produce multi‑dimensional, hierarchical metrics stored in Elasticsearch and Kafka.

Resource metrics from cAdvisor, scraped by Prometheus and forwarded to Kafka via a Prometheus‑Kafka adapter.

Logs (both structured gateway logs and unstructured inference service logs) shipped to Kafka and visualized through the ELK stack.

Grafana dashboards provide real‑time visibility, supporting alerting, traffic‑management decisions, and auto‑scaling.

Results and Conclusion

The migration to the Istio‑based 2.0 architecture yields more than 50 % reduction in inference latency, improved stability through resource isolation, and richer traffic‑management features that simplify operations. Ongoing work focuses on leveraging newer K8s and Istio capabilities to further enhance performance and observability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Istio Service Mesh Traffic Management AI inference

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.