Design and Implementation of Bilibili's Offline Multi‑Datacenter Solution
This article describes Bilibili's offline multi‑datacenter architecture, explaining why a scale‑out approach was chosen over scale‑up, and detailing the unit‑based design, job placement, data replication, routing, versioning, bandwidth throttling, traffic analysis, and the operational results and future directions.
Bilibili's rapid business growth caused offline cluster capacity to approach data‑center limits, prompting the need for a solution beyond simply moving clusters to larger facilities (scale‑up).
Two main options were evaluated: (1) scale‑up, which requires full‑cluster migration without downtime and carries high economic cost and future capacity risk; (2) scale‑out, which adds multiple data‑centers while keeping the user view of a single cluster, but introduces network bandwidth and stability challenges. Scale‑out was selected as the preferred strategy.
The design adopts a unit‑based architecture where each data‑center acts as an independent unit containing all services and data needed for job execution, thereby isolating failures and reducing cross‑datacenter traffic. Job placement is driven by dependency analysis using DAG community detection to group tightly coupled jobs into the same unit, and a DataManager service stores placement and replication metadata.
Data replication is built on an enhanced DistCp tool with atomicity, idempotence, flow‑control, multi‑tenant prioritization, and lifecycle management. Rules extracted from job histories guide automatic replication of hot Hive tables, while snapshot‑based copying handles initial bulk data.
Data routing leverages an HDFS Router with multi‑mount points; client IP determines the nearest data‑center replica, and a special IDC_FOLLOW mount handles temporary tables with dynamic paths.
To maintain consistency across replicas, a version service monitors HDFS edit logs and updates replica versions, allowing jobs to use ready replicas or block until consistency is achieved.
Because cross‑datacenter bandwidth is limited (~4 Tbps) and shared with latency‑sensitive services, a token‑bucket based throttling service (ThrottleService) enforces read/write limits, prioritizes high‑importance jobs, and provides graceful degradation on failures.
Cross‑datacenter traffic analysis collects per‑job, per‑IP flow logs via instrumented DataNode and engine components, aggregates them in ClickHouse, and visualizes top‑traffic jobs to guide placement optimization and ad‑hoc traffic mitigation.
After six months of production, the solution has migrated ~300 PB of data, handling one‑third of offline jobs while alleviating bandwidth and stability issues; future work includes further unit‑based automation and community‑detection‑driven migration planning.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.