Artificial Intelligence 4 min read

Coeus: Bilibili's Cloud‑Native AI Platform and the PyTorch Training Performance Tuning Handbook

The article introduces Coeus, Bilibili's cloud‑native AI platform built on Kubernetes with Alluxio integration, explains how it solves major data and compute challenges, improves training performance, and promotes a free PyTorch performance‑tuning guide for engineers.

DataFunTalk

Nov 9, 2023

Coeus: Bilibili's Cloud‑Native AI Platform and the PyTorch Training Performance Tuning Handbook

Coeus is Bilibili’s self‑developed cloud‑native artificial‑intelligence platform that supports a wide range of scenarios such as advertising, resume analysis, NLP, speech, and e‑commerce.

From a functional perspective, Coeus provides model development, model training, model storage, and model serving capabilities.

The platform runs on Kubernetes and integrates many cloud‑native components, including Volcano, VPA, the in‑house observability system Hawkeye, Alluxio and Fluid.

Alluxio acts as a bridge between underlying storage systems (OSS and HDFS) and AI applications built on PyTorch or TensorFlow, enabling video and image training jobs.

Using Alluxio as an intermediate layer between compute and storage helped Bilibili overcome four major challenges: container crashes, the need to modify application code to access OSS/HDFS, data size exceeding a single machine’s capacity, and slow repeated remote data fetches.

These improvements boosted Bilibili’s machine‑learning workload performance by three times, lowered infrastructure costs, and enhanced model‑training quality. Other leading companies such as Alipay and Zhihu also adopt Alluxio to address low training efficiency, high cost, poor reliability, and limited scalability.

The article also offers a free downloadable “PyTorch Model Training Performance Tuning Handbook,” which covers PyTorch fundamentals, factors affecting training performance, step‑by‑step optimization techniques, code examples that can reduce epoch time to one‑tenth of the original, and a case study of using Alluxio as a data‑access layer in production.

The handbook is intended for AI/ML platform engineers, data platform engineers, backend engineers, MLOps engineers, SREs, architects, and machine‑learning engineers who want to master PyTorch performance tuning.

Acknowledgments are given to translators and Alluxio community volunteers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kubernetes PyTorch AI platform Alluxio

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.