Big Data 30 min read

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

Bilibili migrated its massive offline platform from Hive to Spark using an automated SQL rewrite and dual‑run verification, cutting execution time over 40% and resource use 30%, while introducing small‑file merging, shuffle stability, runtime filters, data‑skipping, lineage tracking, auto‑parameter tuning, and metastore federation for robust large‑scale processing.

Bilibili Tech

May 31, 2022

In 2018 Bilibili built its offline computing service on Hadoop, growing from a few hundred nodes to nearly ten thousand across multiple data centers. Production workloads use Hive, Spark and Presto as offline engines, with Hive and Spark running on YARN. Approximately 200,000 batch jobs run daily on Spark and Hive.

Migration from Hive to Spark

At the beginning of 2021 Hive was still the primary engine (over 80% of jobs). Spark 2.4 accounted for about 20% of jobs. After Spark 3.1 was released in March 2021, Bilibili began migrating Hive‑SQL to Spark‑SQL. The migration was done by first manually moving a small set of jobs, then developing an automatic migration tool that rewrites SQL, replaces input/output tables, and performs result comparison to ensure correctness.

SQL Statement Conversion

The SparkSqlParser was rewritten to replace input and output tables collected from the scheduler. For DAG‑level jobs, all dependent SQL statements are replaced together to preserve dependencies. SELECT statements without output tables are transformed into CTAS statements for result comparison. DDL statements that do not consume compute are skipped and marked to be executed on Hive.

Result Comparison

Dual‑run results are compared by first checking schema compatibility (using DESC). For matching schemas, a full data comparison is performed using a GROUP BY‑based approach that identifies rows present in both results (cnt = 2) and differences (cnt ≠ 2). This method works for most cases, but containers such as LIST, SET, MAP may cause false mismatches due to unordered toString output, and non‑deterministic queries (e.g., random()) require manual analysis.

Resource usage was also measured: Spark jobs reduced execution time by over 40% and overall resource consumption by more than 30% compared with Hive.

Migration & Rollback

Each migrated task is run at least three times with dual‑run comparison. Post‑migration monitoring tracks the first three executions of a task and compares time, CPU and memory against the average of the previous seven runs. If performance degrades, the task is rolled back to Hive and an alert is raised.

Spark Practices at Bilibili

Small File Issue

Rapid data growth caused many small files, increasing HDFS metadata pressure and read latency. Two solutions were implemented:

Fallback file merging: write to a temporary directory, then after FileFormatWriter.write and before refreshUpdatedPartitions, merge small files by coalescing RDDs and moving the result to the final location.

Repartition‑based merging: leverage Spark 3.x AQE to automatically rebalance partitions, inserting a rebalance hint that triggers a shuffle stage to produce appropriately sized files.

Shuffle Stability

Disk tiering for shuffle: prioritize SSD directories for shuffle data; fall back to HDD when SSD space is low.

Remote Shuffle Service (RSS): adopt the community push‑based shuffle (Spark 3.2) to reduce random I/O and improve stability. A shuffle‑service master node tracks executor health and selects suitable shuffle nodes, reducing fetch failures.

Runtime Filter

Dynamic Bloom filter pruning is added to filter large tables before shuffle when join keys are known, reducing data processed from billions of rows to tens of thousands.

Data Skipping

ORC and Parquet provide statistics (min/max, count, sum) at file/stripe/row‑group levels. Bilibili enhances data skipping by ordering hot columns (e.g., state) to enable effective pruning, dramatically reducing scanned rows.

select count(1) from tpcds.archive_spl_cluster where log_date='20211124' and state = -16

Functional Improvements

ZSTD compression support added to Spark 3.2; bugs in ORC filter push‑down were fixed and contributed upstream.

Multi‑format read compatibility: DataSourceScanExec now selects readers based on partition metadata, allowing tables to contain mixed file formats.

Convert and merge table syntax added:

CONVERT TABLE target=tableIdentifier (convertFormat | compressType) partitionClause? #convertTable
MERGE TABLE target=tableIdentifier partitionClause? #mergeTable

Lineage Extraction

A custom LineageQueryListener captures query execution plans, maps expression IDs to source columns, and builds field‑level lineage (PROJECTION/PREDICATE) and hierarchical relations.

Automatic Parameter Optimization (HBO)

HBO fingerprints SQL jobs, collects execution metrics, and recommends parameter adjustments (memory, parallelism, shuffle strategy, small‑file merging). Recommendations are applied automatically unless overridden by user‑specified settings.

Smart Data Manager (SDM)

SDM provides asynchronous table format conversion, small‑file merging, data re‑organization (order/Z‑order), statistics collection, Hive index creation, and lineage parsing. Operations are performed with Hive‑style lock management to be transparent to users.

Hive Metastore Optimizations

MetaStore federation was chosen over WaggleDance to enable cross‑data‑center metadata storage with minimal migration impact. Federation adds a StateStore router to direct metadata queries to the appropriate MySQL instance.

Additional improvements include request tracing via CallerContext, traffic control with a TrafficControlListener, and aggressive partition‑count limits to protect HMS memory.

Future Work

Investigate remote shuffle service for better K8s integration.

Adopt vectorized execution to accelerate Spark.

Enhance automated diagnosis systems for improved user experience.

Overall, Bilibili’s offline platform demonstrates a comprehensive migration from Hive to Spark, extensive performance tuning, and robust operational tooling for large‑scale data processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Big Data Hive Spark

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.