Big Data 14 min read

Practical Experience of Bilibili's Big Data Cluster Mixed Deployment Architecture

This article details Bilibili's offline big‑data cluster challenges, the mixed‑deployment architecture that combines offline and online resources, the Amiya service's over‑commit and eviction mechanisms, performance optimizations, monitoring strategies, and future plans to further improve resource utilization and scheduling.

DataFunSummit

Sep 2, 2023

Practical Experience of Bilibili's Big Data Cluster Mixed Deployment Architecture

Background: Bilibili's offline platform faces rapid cluster growth and resource shortage, prompting the need for higher utilization without additional hosts.

Mixed deployment architecture: offline machines host multiple components (compute, transcoding, storage) and tasks are also scheduled onto online clusters during low‑peak periods, resulting in scenarios such as offline‑offline mixing, offline‑online mixing, in‑offline mixing, and tidal mixing.

Implementation and benefits: The Amiya service provides dynamic over‑commit, task eviction, and node monitoring. Its modules include StateStore Manager, CheckPoint Manager, NodeResource Manager, Operator Manager, Inspect Manager, and Audit Manager, enabling accurate resource profiling, intelligent adjustment, and fault tolerance.

NodeResource Manager determines when to increase or decrease resource quotas based on usage, while Inspect Manager monitors node health (CPU, memory, OOM, disk I/O) and triggers evictions. Integration with K8s, Yarn, and Spark allows flexible scheduling and label‑based resource pools.

Optimizations such as separating node‑update events and removing scheduling locks improve dispatch performance; over‑commit via Amiya adds over 600 TB of resource capacity across 5,000+ nodes and raises CPU utilization by about 10 %.

Q&A highlights include CPU/memory isolation via cgroup, an 85 % utilization target, monitoring of disk I/O, and future plans to migrate Spark jobs to K8s and unify scheduling.

Conclusion: While the mixed‑deployment architecture significantly improves resource efficiency, challenges remain in memory‑intensive tasks, eviction policies for low‑priority jobs, and achieving a fully unified scheduler.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data mixed deployment Bilibili Amiya Resource Overcommit Cluster Scheduling

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.