Practical Experience of Bilibili's Big Data Cluster Mixed Deployment Architecture
This article details Bilibili's offline big‑data cluster challenges, the mixed‑deployment architecture that combines offline and online resources, the Amiya service's over‑commit and eviction mechanisms, performance optimizations, monitoring strategies, and future plans to further improve resource utilization and scheduling.
Background: Bilibili's offline platform faces rapid cluster growth and resource shortage, prompting the need for higher utilization without additional hosts.
Mixed deployment architecture: offline machines host multiple components (compute, transcoding, storage) and tasks are also scheduled onto online clusters during low‑peak periods, resulting in scenarios such as offline‑offline mixing, offline‑online mixing, in‑offline mixing, and tidal mixing.
Implementation and benefits: The Amiya service provides dynamic over‑commit, task eviction, and node monitoring. Its modules include StateStore Manager, CheckPoint Manager, NodeResource Manager, Operator Manager, Inspect Manager, and Audit Manager, enabling accurate resource profiling, intelligent adjustment, and fault tolerance.
NodeResource Manager determines when to increase or decrease resource quotas based on usage, while Inspect Manager monitors node health (CPU, memory, OOM, disk I/O) and triggers evictions. Integration with K8s, Yarn, and Spark allows flexible scheduling and label‑based resource pools.
Optimizations such as separating node‑update events and removing scheduling locks improve dispatch performance; over‑commit via Amiya adds over 600 TB of resource capacity across 5,000+ nodes and raises CPU utilization by about 10 %.
Q&A highlights include CPU/memory isolation via cgroup, an 85 % utilization target, monitoring of disk I/O, and future plans to migrate Spark jobs to K8s and unify scheduling.
Conclusion: While the mixed‑deployment architecture significantly improves resource efficiency, challenges remain in memory‑intensive tasks, eviction policies for low‑priority jobs, and achieving a fully unified scheduler.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.