Artificial Intelligence 15 min read

Architecture of Tencent Cloud AI Platform (YunZhiTianshu) and AI Practices on Kubernetes

The article details Tencent Cloud’s YunZhiTianshu AI platform architecture—spanning Docker/Kubernetes infrastructure, storage, six micro‑service layers and API/message gateways—while explaining core module designs, unified algorithm packaging, device and data abstraction, and practical Kubernetes deployment techniques for GPU‑accelerated AI workloads, monitoring, scaling, and security.

Tencent Cloud Developer

Sep 20, 2019

Architecture of Tencent Cloud AI Platform (YunZhiTianshu) and AI Practices on Kubernetes

The article reports on a technical salon hosted by Tencent Cloud Community on September 7, where five Tencent Cloud experts presented the AI technology principles and practice, focusing on the YunZhiTianshu AI service platform, OCR, NLP, machine learning, and intelligent dialogue services.

Speaker Huang Wencai, who joined Tencent in 2010 and has experience in large‑scale distributed systems, serves as the chief architect of the YunZhiTianshu platform.

The content is divided into three parts: (1) the overall architecture of the YunZhiTianshu platform, (2) the design of each core module, and (3) practical experience of running AI workloads on Kubernetes (K8s).

YunZhiTianshu Platform Architecture

The platform provides a full‑stack AI service environment that supports rapid integration of algorithms, data, and intelligent devices, with visual orchestration tools for service and resource management. It offers standardized APIs and continuous integration of AI service components to accelerate AI application development.

Typical usage scenario: an application calls the API gateway, which creates a face‑structuring task in the task manager. The task fetches images from a device service, invokes a face‑attribute service, stores structured data via the data center, and pushes results through a message component that applications can subscribe to.

Three‑layer Architecture

From bottom to top: the infrastructure layer is built on Docker, Kubernetes, and Tencent’s CI/CD pipeline; the storage layer uses MySQL, Kafka, InfluxDB, COS/Ceph, Elasticsearch, etc.; the middle layer consists of six micro‑service groups:

Algorithm repository – self‑service image building for algorithms and models.

Device center – unified device onboarding for cameras, AI cameras, etc.

Data center – data ingestion, transformation, storage, and abstraction of storage media.

AI studio – task scheduling and workflow orchestration (12+ industry applications, 30+ generic components).

Application center – app creation, key management, subscription, and media library.

Management center – account, role, image repository, audit logs, etc., with inter‑module decoupling via Kafka.

The top layer is the gateway, split into an API gateway (Tencent Cloud API 3.0 standards) and a message gateway (supporting gRPC and HTTP push, with monitoring via Telegraf, InfluxDB, Grafana, and ELK).

Core Module Design

AI studio comprises three parts: platform integration system, workflow engine, and function service. The function service runs user‑provided Python snippets to perform data conversion between services, with security checks that block unsafe packages.

Algorithm integration challenges (various vendors, protocols, high onboarding cost) are addressed by a unified image‑building platform that lets non‑Docker users create algorithm micro‑services through a web UI. Optimizations include pre‑built base images with common GCC/CUDA/Boost, Alpine‑based minimal images, and single‑line RUN commands to reduce layers.

Device center handles heterogeneous protocols (ONVIF, ISAPI, GB28181, proprietary SDKs) by abstracting them into three micro‑service layers: upper‑level service logic (base image), adaptation logic (SO plugins), and the proprietary SDK itself. HTTP interfaces with varying request/response formats are also handled by the function service.

Data center abstracts storage differences (Ceph, COS, NAS) via a FileAgent sidecar container, exposing a uniform file interface to other modules and decoupling them from underlying storage.

Monitoring is built with open‑source Telegraf + InfluxDB + Grafana, using a daemonset monitor_agent on each node that pushes metrics to InfluxDB; Grafana visualizes them. A brief comparison with Prometheus is provided, noting exporter‑based collection, UI limitations, and push vs. pull models.

AI on Kubernetes – Practical Experience

The article reviews GPU computing history, CUDA architecture (library, runtime, driver), and the need to match driver and runtime versions inside containers. It outlines steps to expose GPUs to containers (‑‑device, privileged mode), mount CUDA driver APIs, bundle runtime and libraries into images, and use nvidia‑container‑runtime and nvidia‑device‑plugin for GPU scheduling and partitioning.

GPU virtualization solutions (GRID and MPS) are discussed, including their trade‑offs in isolation, openness, and granularity of resource slicing.

Additional K8s challenges covered: containerizing stateful components (MySQL, Kafka), service discovery using Consul DNS, distributed tracing with Jaeger (and future Istio OpenTracing), auto‑scaling of GPU resources, load balancing (including consistent hashing), storage containerization, and security concerns (shared kernel, image and function security).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices Kubernetes GPU AI platform

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.