Big Data 26 min read

How Bilibili Scaled Offline Processing Across Multiple Data Centers

This article details Bilibili's multi‑datacenter offline architecture, explaining the capacity challenges, the chosen scale‑out design, and the implementation of job placement, data replication, routing, versioning, throttling, and traffic analysis to efficiently handle massive batch workloads across geographically distributed clusters.

dbaplus Community
dbaplus Community
dbaplus Community
How Bilibili Scaled Offline Processing Across Multiple Data Centers

Background

Bilibili’s rapid growth caused offline batch clusters to exhaust a single‑datacenter capacity, prompting a shift from a scale‑up migration to a scale‑out multi‑datacenter architecture.

Multi‑Datacenter Architecture

Key Challenges

Bandwidth bottleneck: massive cross‑datacenter traffic from offline batch jobs can saturate limited inter‑datacenter links.

Network jitter and connectivity: inter‑datacenter links are less stable than intra‑datacenter networks, leading to latency spikes or outages.

Design Overview

Each datacenter is treated as an independent unit containing a full YARN/HDFS stack, isolating failures and limiting cross‑datacenter traffic. A multi‑cluster storage model (separate HDFS namespaces per site) is adopted. Core components include:

Job placement based on DAG dependency analysis and data size.

Periodic data replication to keep required replicas in remote sites.

Client‑IP‑aware HDFS Router with multi‑mount points for local reads.

Version service that tracks HDFS edit logs to guarantee replica consistency.

Token‑bucket throttling to protect shared inter‑datacenter bandwidth.

Cross‑datacenter traffic analysis for monitoring and optimization.

HDFS architecture diagram
HDFS architecture diagram

Implementation Details

Job Placement

Dependency analysis builds a DAG from the scheduler’s metadata and applies community‑detection clustering to group tightly coupled jobs. The resulting placement information and replication rules are stored in a DataManager service. Scheduling platforms (Archer, Airflow) query DataManager before job submission. YARN is extended with application tags and queue mappings so that jobs are dispatched to the appropriate datacenter’s YARN cluster.

Data Replication

The replication service extends the open‑source DistCp tool with guarantees for correctness, atomicity, idempotence, flow control, multi‑tenant priority, and lifecycle management. Rules derived from job histories define which HDFS paths require replication.

For Hive tables, MetaStore events are streamed to Kafka; a Flink job extracts matching paths and triggers the Data Replication Service (DRS). Measured latency shows 90th‑percentile replica availability within 1 minute and 99th‑percentile within 5 minutes.

Data replication workflow
Data replication workflow

Data Routing

After replication each datacenter holds a local copy. The HDFS Router uses a multi‑mount‑point layout: the first mount stores primary data, subsequent mounts store read‑only replicas. The router detects the client’s datacenter via IP and redirects reads to the local replica; if the replica is unavailable it falls back to a throttled cross‑datacenter read.

Data routing based on router
Data routing based on router

Version Service

A version service observes HDFS edit logs, assigns transaction IDs as version identifiers, and updates replica metadata when primary data changes. Jobs query the version service to ensure they read up‑to‑date replicas.

Version service workflow
Version service workflow

Throttling Service

A token‑bucket mechanism limits cross‑datacenter bandwidth. Tokens represent a fixed traffic quota; jobs must acquire tokens before remote reads/writes. The service supports priority queues, weighted fairness, and degrades to a fixed local bandwidth when unavailable.

Token bucket diagram
Token bucket diagram

Cross‑Datacenter Traffic Analysis

DFS client code injects the JobId into the client name; DataNode code extracts JobId and client IP from the data transfer channel. Logs are streamed to Flink, aggregated into ClickHouse, and visualized in dashboards that highlight top‑traffic jobs.

Traffic log collection pipeline
Traffic log collection pipeline

For ad‑hoc workloads that cannot rely on pre‑placed replicas, runtime SQL scanning selects the optimal datacenter, and Presto’s multi‑connector capability pushes sub‑queries to remote sites when beneficial.

Operational Results

The multi‑datacenter offline solution has been in production for over six months, migrating roughly 300 PB of data and handling about one‑third of all offline jobs. It has effectively mitigated bandwidth saturation and improved job stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data replicationHDFSbandwidth optimizationjob placementmulti‑datacenter
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.