Artificial Intelligence 17 min read

Design and Architecture of the Weibo Deep Learning Platform

This article presents the design, architecture, and operational experience of Weibo's deep learning platform, covering its machine‑learning workflow, control center, distributed training cluster, and online prediction service, and explains how the platform accelerates development and improves business outcomes.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Architecture of the Weibo Deep Learning Platform

With the maturation of artificial neural network algorithms and the rise of GPU computing power, deep learning has achieved major breakthroughs across many fields. This article introduces Weibo's adoption of deep learning and the experience of building a deep learning platform, focusing on the machine‑learning workflow, control center, training cluster, and online prediction service.

Artificial intelligence endows machines with human‑like intelligence, and deep learning dramatically expands AI capabilities, surpassing human performance in tasks such as natural language understanding, image recognition, and speech recognition.

Deep learning frameworks such as TensorFlow, Caffe, MXNet, PaddlePaddle, and others provide modular building blocks that lower the barrier to entry, allowing developers to assemble models without writing complex neural‑network code.

A deep learning platform integrates all stages—from data ingestion and processing to model training, evaluation, and deployment—thereby speeding up development cycles, sharing computational resources, and improving both model and business performance.

Other industry platforms are briefly described: Tencent's DI‑X, Alibaba's PAI, and Baidu's deep learning platform, each offering multi‑framework support and cloud‑based training services.

Weibo's platform, launched in 2017, supports TensorFlow, Caffe, and other frameworks on a GPU‑enabled cloud infrastructure, providing a one‑stop solution for tasks such as feed CTR prediction, anti‑spam, image classification, and recommendation.

The platform architecture (see Figure 1 in the original article) consists of a machine‑learning workflow (WeiFlow), a control center (WeiCenter), a distributed training cluster, and an online prediction service (WeiServing).

WeiFlow implements a double‑layer DAG design: the outer DAG consists of nodes, each representing an inner DAG that runs on a specific engine (Spark, TensorFlow, Hive, Storm, Flink, etc.). Users define the workflow in XML, and WeiFlow automatically generates the task graph.

WeiCenter offers job management, data management, and scheduling management, simplifying resource allocation across YARN, Mesos, and Kubernetes, and handling job priorities, resource usage, and fault recovery.

The training cluster emphasizes single‑machine multi‑GPU servers, distributed training with TensorFlow parameter servers, 10 Gb Ethernet networking, HDFS for shared storage, and a custom scheduler that gracefully terminates parameter‑server processes.

WeiServing provides online model prediction with diverse feature‑processing functions, multi‑model and multi‑version support via Docker/Kubernetes isolation, a distributed parameter service (WeiParam), and integration with both offline file storage and real‑time streaming sources.

In summary, the Weibo deep learning platform unifies workflow, resource management, and service deployment, greatly enhancing development efficiency and accelerating business iteration.

AIdeep learningplatform architecturedistributed trainingonline predictionmachine learning workflow
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.