How Bilibili Scaled Offline Processing Across Multiple Data Centers
This article details Bilibili's multi‑datacenter offline architecture, explaining the capacity challenges, the chosen scale‑out design, and the implementation of job placement, data replication, routing, versioning, throttling, and traffic analysis to efficiently handle massive batch workloads across geographically distributed clusters.
Background
Bilibili’s rapid growth caused offline batch clusters to exhaust a single‑datacenter capacity, prompting a shift from a scale‑up migration to a scale‑out multi‑datacenter architecture.
Multi‑Datacenter Architecture
Key Challenges
Bandwidth bottleneck: massive cross‑datacenter traffic from offline batch jobs can saturate limited inter‑datacenter links.
Network jitter and connectivity: inter‑datacenter links are less stable than intra‑datacenter networks, leading to latency spikes or outages.
Design Overview
Each datacenter is treated as an independent unit containing a full YARN/HDFS stack, isolating failures and limiting cross‑datacenter traffic. A multi‑cluster storage model (separate HDFS namespaces per site) is adopted. Core components include:
Job placement based on DAG dependency analysis and data size.
Periodic data replication to keep required replicas in remote sites.
Client‑IP‑aware HDFS Router with multi‑mount points for local reads.
Version service that tracks HDFS edit logs to guarantee replica consistency.
Token‑bucket throttling to protect shared inter‑datacenter bandwidth.
Cross‑datacenter traffic analysis for monitoring and optimization.
Implementation Details
Job Placement
Dependency analysis builds a DAG from the scheduler’s metadata and applies community‑detection clustering to group tightly coupled jobs. The resulting placement information and replication rules are stored in a DataManager service. Scheduling platforms (Archer, Airflow) query DataManager before job submission. YARN is extended with application tags and queue mappings so that jobs are dispatched to the appropriate datacenter’s YARN cluster.
Data Replication
The replication service extends the open‑source DistCp tool with guarantees for correctness, atomicity, idempotence, flow control, multi‑tenant priority, and lifecycle management. Rules derived from job histories define which HDFS paths require replication.
For Hive tables, MetaStore events are streamed to Kafka; a Flink job extracts matching paths and triggers the Data Replication Service (DRS). Measured latency shows 90th‑percentile replica availability within 1 minute and 99th‑percentile within 5 minutes.
Data Routing
After replication each datacenter holds a local copy. The HDFS Router uses a multi‑mount‑point layout: the first mount stores primary data, subsequent mounts store read‑only replicas. The router detects the client’s datacenter via IP and redirects reads to the local replica; if the replica is unavailable it falls back to a throttled cross‑datacenter read.
Version Service
A version service observes HDFS edit logs, assigns transaction IDs as version identifiers, and updates replica metadata when primary data changes. Jobs query the version service to ensure they read up‑to‑date replicas.
Throttling Service
A token‑bucket mechanism limits cross‑datacenter bandwidth. Tokens represent a fixed traffic quota; jobs must acquire tokens before remote reads/writes. The service supports priority queues, weighted fairness, and degrades to a fixed local bandwidth when unavailable.
Cross‑Datacenter Traffic Analysis
DFS client code injects the JobId into the client name; DataNode code extracts JobId and client IP from the data transfer channel. Logs are streamed to Flink, aggregated into ClickHouse, and visualized in dashboards that highlight top‑traffic jobs.
For ad‑hoc workloads that cannot rely on pre‑placed replicas, runtime SQL scanning selects the optimal datacenter, and Presto’s multi‑connector capability pushes sub‑queries to remote sites when beneficial.
Operational Results
The multi‑datacenter offline solution has been in production for over six months, migrating roughly 300 PB of data and handling about one‑third of all offline jobs. It has effectively mitigated bandwidth saturation and improved job stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
