Databases 10 min read

Practical Experience with HBase at NetEase: Architecture, Use Cases, and Troubleshooting

This article presents NetEase's extensive use of HBase within its big‑data platform, covering the system’s role, real‑world application scenarios, common RIT issues, HBCK repair methods, and a systematic approach to monitoring and troubleshooting performance problems.

DataFunTalk
DataFunTalk
DataFunTalk
Practical Experience with HBase at NetEase: Architecture, Use Cases, and Troubleshooting

HBase is a core component of the Hadoop ecosystem, serving as a high‑performance key‑value store that supports both small‑scale scans and large‑scale data retrieval. In the big‑data domain, it is positioned as the primary online storage layer, complementing offline stores like HDFS, Hive, and Spark.

NetEase operates a massive HBase deployment with over 300 physical machines and more than 3 PB of data, supporting services such as NetEase Kaola, Cloud Music, News client, and various internal cloud and data‑processing platforms. Data flows from relational sources (MySQL), log systems, app/web clickstreams, and sensor feeds into HDFS, then is ingested via Sqoop, DataStream, Kafka, Spark Streaming, or Flink into storage layers that include offline (HDFS‑based), online (HBase/Phoenix), and time‑series stores (OpenTSDB, Druid, InfluxDB).

Key online use cases include:

Real‑time recommendation for news articles, where feature models are trained offline, bulk‑loaded into HBase, and served instantly to users.

Internal sentinel monitoring of tens of thousands of servers, leveraging HBase’s aggregation capabilities over OpenTSDB.

E‑commerce order history, messaging history, push notifications, low‑latency dashboards, and various audit logs.

When HBase encounters region‑in‑transaction (RIT) problems, the hbck tool is used for inspection and repair. HBCK performs consistency checks (region assignment, metadata alignment) and integrity checks (row‑key uniqueness per region). Common repair commands include:

./bin/hbase hbck

./bin/hbase hbck --details

./bin/hbase hbck TableFoo TableBar

./bin/hbase hbck --fixAssignments

./bin/hbase hbck --fixMeta

Low‑risk fixes (≈80 % of cases) involve -fixAssignments and -fixMeta to correct missing or duplicate region assignments. High‑risk fixes, such as overlapping region repairs, may require manual HDFS file edits and should be performed with caution.

General troubleshooting follows a layered approach: first examine monitoring metrics (CPU, I/O, network, GC, compaction, queue lengths), then dive into master and region‑server logs for DDL, balance, snapshot, and read/write operations. If the issue remains unresolved, seek community or internal assistance.

Finally, the article emphasizes the importance of a complete post‑mortem: after resolving a problem, review monitoring data, logs, and source code to understand root causes and prevent recurrence.

big dataDatabaseHBasetroubleshootingHBCKRIT
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.