Big Data 16 min read

HBase Bulkload Practice at Youzan: From MapReduce to Spark Evolution

Youzan’s evolution of HBase bulk‑load—from manual MapReduce jobs to Hive‑SQL and finally Spark—demonstrates how generating HFiles on HDFS, partitioning by region, sorting keys, and handling serialization issues enables billions of records to be loaded efficiently without disrupting production clusters.

Youzan Coder
Youzan Coder
Youzan Coder
HBase Bulkload Practice at Youzan: From MapReduce to Spark Evolution

HBase is a column-oriented, schemaless, high-throughput, highly reliable NoSQL database that supports horizontal scaling. While HBase excels at real-time reading and writing of massive data, it lacks native secondary indexes and has limited support for complex query scenarios. To address batch data import needs, Youzan developed Bulkload technology to efficiently load billions of records into HBase without impacting production cluster stability.

Bulkload works by generating HFile data format files directly on HDFS, then uploading them to appropriate locations for rapid data入库. The overall process includes: (1) Extract - importing heterogeneous data sources to HDFS; (2) Transform - converting data to HFile format using MapReduce or Spark; (3) Load - using loadIncrementalHFiles to place HFiles into corresponding Region directories on HDFS.

Youzan evolved through three Bulkload implementations: MapReduce-based approach requiring manual Mapper/Reducer writing; Hive SQL approach enabling direct Hive-to-HBase export but with complex preprocessing; and Spark Bulkload, which offers faster execution, SQL-based data filtering, and greater flexibility. The Spark implementation requires careful handling of partitioning based on HBase table regions and proper sorting of rowkeys, column families, and qualifiers.

Common issues encountered include: "Added a key not lexically larger than previous" exceptions caused by sorting problems; ImmutableBytesWritable serialization errors resolved by configuring KryoSerializer; comparator serialization requiring Serializable interface implementation; driver-side object access issues requiring broadcast variables; and HBase version jar conflicts solvable by specifying particular jar versions in Spark commands.

The article provides detailed code examples including RegionPartitioner implementation for mapping rowkeys to appropriate partitions, KeyQualifierComparator for multi-level sorting, Spark job configuration with HFileOutputFormat2, and the complete data transformation pipeline from Spark Dataset to HFile generation.

distributed systemsBig DataHBaseNoSQLSparkHadoopBulkloaddata-import
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.