Big Data 28 min read

Multi‑Datacenter Architecture for Offline Big Data Processing at Bilibili

To overcome rapid data growth and on‑premise capacity limits, Bilibili adopted a scale‑out, unit‑based multi‑datacenter architecture that isolates failures, intelligently places jobs, replicates data via an enhanced DistCp service, routes reads with an IP‑aware HDFS router, and throttles cross‑site traffic, enabling stable offline big‑data processing of hundreds of petabytes while preserving throughput.

Bilibili Tech

Jul 5, 2022

Multi‑Datacenter Architecture for Offline Big Data Processing at Bilibili

01 Background

Bilibili’s rapid business growth has caused the production speed of business data to increase dramatically, leading to a fast expansion of the offline cluster. Existing on‑premise racks are being consumed quickly and will soon reach capacity limits, which would block further business development. Two mainstream solutions exist: (1) scale‑up – migrate the whole cluster to a larger datacenter (vertical expansion); this requires a full‑scale migration with minimal downtime and incurs huge economic cost, and the new datacenter may again hit capacity limits; (2) scale‑out – add multiple datacenters and adapt the architecture so that users still see a single logical cluster. Scale‑out offers flexible incremental expansion but introduces cross‑datacenter bandwidth and network‑stability challenges. Given the lack of a short‑term plan to retire the existing datacenter, Bilibili chose the scale‑out approach.

02 Multi‑Datacenter Solution

2.1 Problems

Bandwidth bottleneck : Offline batch jobs process massive historical data, consuming large network bandwidth. When two datacenters are simply merged into one logical cluster, random cross‑datacenter traffic can saturate limited inter‑datacenter links, affecting both offline jobs and other services.

Network jitter & connectivity : Cross‑city networks are less stable than intra‑datacenter CLOS networks. Network jitter or outages increase read/write latency, cause DN IBR delays, and can lead to under‑replicated blocks, triggering massive re‑replication.

2.2 Design Choices

The solution adopts a unit‑based architecture: each datacenter becomes a self‑contained unit providing all services and data required for its jobs. This isolates failures, reduces cross‑datacenter traffic, and simplifies job placement. Data placement is decided by analyzing job dependencies and data size, then assigning tightly coupled jobs to the same unit. Data replication is performed via an enhanced DistCp‑based service that creates cross‑datacenter replicas, manages lifecycle, and supports throttling and multi‑tenant priorities. Data routing uses an extended HDFS Router with multiple mount points (mirrored mounts) that senses the client’s IP to route requests to the nearest replica. A version service built on HDFS edit logs tracks data versions to ensure consistency between primary and replica data.

2.3 Overall Workflow

Figure 5 (described in text) shows the end‑to‑end flow for a Hive job: a periodic dependency analysis determines placement, the DataManager stores placement and replication rules, the scheduler (Archer/Airflow) queries DataManager for the target IDC, checks replica readiness, and tags the application for the appropriate YARN federation queue. The job is submitted to the YARN cluster in the chosen datacenter; HDFS Router routes file reads to the nearest replica based on client IP. If replicas are not ready, submission is blocked until replication completes.

03 Implementation Details

3.1 Job Placement

Dependency analysis builds a DAG of jobs, then applies community‑detection to partition the graph into high‑cohesion sub‑units, minimizing cross‑datacenter data transfer. Placement decisions are refreshed daily or weekly. DataManager acts as a central service exposing placement and replication‑path information to all schedulers.

3.2 Data Replication

The replication service enhances DistCp with atomicity, idempotence, and flow‑control. Rules extracted from job histories are stored in DataManager. For Hive tables, MetaStore events are streamed to Kafka, processed by Flink to match rules, and trigger the Data Replication Service (DRS) to copy data across datacenters. Over 100 hot Hive tables have cross‑datacenter replicas with PT90 latency ≤ 1 min and PT99 ≤ 5 min. For large historical datasets, a snapshot‑based initial copy is used.

3.3 Data Routing

Mirrored mount points are ordered: the first mount holds primary data, subsequent mounts hold replicas. The Router inspects the client’s IP to determine its datacenter and redirects the request to the corresponding HDFS namespace. In case of replica‑service failures, a degraded cross‑datacenter read‑throttling path is used.

3.4 Version Service

A version service observes HDFS JournalNode edit logs, assigns transaction IDs as version identifiers, and updates replica metadata. Only read/delete operations are allowed on replicas, aligning with the write‑once‑read‑many offline workload.

3.5 Throttling Service

Because cross‑datacenter bandwidth (~4 Tbps) is shared with latency‑sensitive online services, a token‑bucket‑based throttling layer (ThrottledDistributedFileSystem) limits cross‑datacenter reads/writes. Tokens are granted per‑job priority, and the service supports queue priority, weighted fairness, and graceful degradation.

3.6 Cross‑Datacenter Traffic Analysis

Instrumentation in the HDFS client and DataNode records per‑job traffic, aggregates it via Flink into ClickHouse, and produces a top‑10 traffic dashboard. This enables targeted optimizations such as re‑placement, urgent kill, or job tuning. For ad‑hoc queries, the system can dynamically schedule the job in the datacenter where the majority of required data resides, or use Presto’s multi‑connector capability to push sub‑queries to remote datacenters.

04 Conclusion & Outlook

The multi‑datacenter offline solution has been in production for over six months, migrating ~300 PB of data and handling one‑third of all offline jobs. It effectively mitigates bandwidth and stability issues while maintaining high job throughput. Future work includes further unit‑based automation, smarter DAG community detection for automatic migration, and expanding “dual‑active” capabilities for critical jobs.

References

[1] Unit‑based architecture in finance – https://cloud.tencent.com/developer/article/1891503 [2] Unit‑based introduction – https://help.aliyun.com/document_detail/159741.html [3] Meituan‑Dianping multi‑datacenter Hadoop practice – https://www.infoq.cn/article/fo*rxliycw7exf3g8mct [4] Ctrip Hadoop cross‑datacenter practice – https://www.infoq.cn/article/vV53CWvZgw7oVkjT623K [5] ByteDance 100k‑node HDFS multi‑datacenter evolution – https://www.infoq.cn/article/gtlguya2mo8rgbwndt2x [6] Cloud‑ladder multi‑NameNode cross‑datacenter – https://www.slideserve.com/jola/namenode [7] Yugong: Geo‑Distributed Data and Job Placement at Scale – http://www.cse.cuhk.edu.hk/~jcheng/papers/yugong_vldb19.pdf

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Replication YARN HDFS bandwidth optimization job placement multi-datacenter

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.