Didi HBase Team’s Upgrade from 0.98 to 1.4.8: Challenges, Solutions, and Lessons Learned
Didi's HBase team upgraded eleven clusters from version 0.98 to 1.4.8, tackling maintenance burdens and custom‑patch divergence, validating RPC and HFile compatibility, performing extensive functional and performance tests, opting for a rolling upgrade, fixing a region‑split data‑loss bug, merging critical upstream patches, and establishing a reusable migration methodology.
Background: Didi’s HBase service runs 11 clusters (domestic and overseas) with a total throughput of over 1 k ops/s, serving most business lines (maps, finance, ride‑hailing, etc.). The production clusters were still on version 0.98 while the community’s latest release was 2.3, creating a large gap.
Key challenges of upgrading:
High cost of introducing new features: 0.98 is the first stable release and is no longer maintained; back‑porting new features is increasingly difficult.
Maintenance cost of custom patches: dozens of in‑house patches (label grouping, ACL, monitoring, audit logs, etc.) either diverge from upstream or cannot be merged due to version gaps.
Upstream component requirements: downstream engines (Kylin, GeoMesa, OpenTSDB, JanusGraph) all depend on newer HBase features; none support 0.98.
Therefore, an upgrade was deemed urgent.
Technical challenges identified:
RPC interface compatibility: the upgrade must ensure old and new RPC calls work seamlessly.
HFile format compatibility: 1.4.8 uses HFile v3, while 0.98 uses v2; fortunately v2 can read v3, avoiding a costly back‑port.
Custom patch compatibility: each custom patch needed verification for replacement or migration.
Upstream engine compatibility: all dependent engines must be validated against the new version.
Potential new issues: active community releases may introduce regressions; continuous monitoring is required.
Preparation work performed:
Release note review
Migration and testing of custom patches
Basic functionality and performance testing
Advanced feature testing (Bulkload, Snapshot, Replication, Coprocessor, etc.)
Tracking downstream community patches (over 100 merged)
Cross‑version compatibility testing, especially RPC compatibility
Full test suite covering HBase, Phoenix, GeoMesa, OpenTSDB, JanusGraph
Packaging and configuration preparation
Upgrade options evaluated:
Pros
Cons
New cluster + data migration
Fast rollback, low risk
User‑side switch required, longer upgrade window ( >6 months ), higher operational cost
Rolling upgrade
User‑transparent, short upgrade window, low cost
Higher risk of immediate rollback if failures occur
The team chose the rolling upgrade due to confidence from extensive preparation.
Rolling‑upgrade steps:
Resolve compatibility issues (create new rsgroup metadata, rewrite data, mount new coprocessors, etc.).
Upgrade Master nodes.
Upgrade the meta region.
Upgrade business regions one by one.
Critical incident encountered:
During a region split, an RS crash caused the master’s rollback procedure to delete both parent and child regions, leading to data loss. The issue was fixed in the community ticket HBASE‑23693 .
Other notable patches merged during the upgrade:
HBASE‑22620 – replication znode backlog fix
HBASE‑21964 – throttle‑type quota removal
HBASE‑24401 – fix append failure when hbase.server.keyvalue.maxsize=0
HBASE‑24184 – snapshot ACL fix for simple ACL
HBASE‑24485 – off‑heap memory init optimization
HBASE‑24501 – remove unnecessary lock in ByteBufferArray
HBASE‑24453 – add validation when moving table groups
Summary: The upgrade, spanning nearly a year from planning to completion, successfully aligned Didi’s HBase clusters with the community release, bringing improved stability, usability, and new features. The experience yielded a systematic upgrade methodology that can be reused for future version migrations, delivering greater value to the business.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.