Artificial Intelligence 4 min read

Coeus: Bilibili's Cloud‑Native AI Platform and the PyTorch Training Performance Tuning Handbook

The article introduces Coeus, Bilibili's cloud‑native AI platform built on Kubernetes with Alluxio integration, explains how it solves major data and compute challenges, improves training performance, and promotes a free PyTorch performance‑tuning guide for engineers.

DataFunTalk
DataFunTalk
DataFunTalk
Coeus: Bilibili's Cloud‑Native AI Platform and the PyTorch Training Performance Tuning Handbook

Coeus is Bilibili’s self‑developed cloud‑native artificial‑intelligence platform that supports a wide range of scenarios such as advertising, resume analysis, NLP, speech, and e‑commerce.

From a functional perspective, Coeus provides model development, model training, model storage, and model serving capabilities.

The platform runs on Kubernetes and integrates many cloud‑native components, including Volcano, VPA, the in‑house observability system Hawkeye, Alluxio and Fluid.

Alluxio acts as a bridge between underlying storage systems (OSS and HDFS) and AI applications built on PyTorch or TensorFlow, enabling video and image training jobs.

Using Alluxio as an intermediate layer between compute and storage helped Bilibili overcome four major challenges: container crashes, the need to modify application code to access OSS/HDFS, data size exceeding a single machine’s capacity, and slow repeated remote data fetches.

These improvements boosted Bilibili’s machine‑learning workload performance by three times, lowered infrastructure costs, and enhanced model‑training quality. Other leading companies such as Alipay and Zhihu also adopt Alluxio to address low training efficiency, high cost, poor reliability, and limited scalability.

The article also offers a free downloadable “PyTorch Model Training Performance Tuning Handbook,” which covers PyTorch fundamentals, factors affecting training performance, step‑by‑step optimization techniques, code examples that can reduce epoch time to one‑tenth of the original, and a case study of using Alluxio as a data‑access layer in production.

The handbook is intended for AI/ML platform engineers, data platform engineers, backend engineers, MLOps engineers, SREs, architects, and machine‑learning engineers who want to master PyTorch performance tuning.

Acknowledgments are given to translators and Alluxio community volunteers.

cloud nativeKubernetesPerformance TuningPyTorchAI PlatformAlluxio
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.