Databases 12 min read

Design and Multi‑Tenant Management of HBase at Didi

This article details Didi's use of HBase for various online and offline workloads, covering multi‑language support, data types, rowkey designs for order, trajectory and ETA scenarios, multi‑tenant resource management with DHS and RS Group, and operational best practices.

Big Data Technology & Architecture

May 7, 2019

Design and Multi‑Tenant Management of HBase at Didi

HBase, built on the Hadoop ecosystem, serves both offline batch processing and online low‑latency services at Didi, handling diverse data types such as statistical reports, raw fact data, intermediate model training data, and backup copies.

To accommodate Didi's heterogeneous development stacks, HBase offers multiple language interfaces including Java native API, Thrift Server (C++, PHP, Python), Phoenix JDBC, Phoenix QueryServer, MapReduce, Spark, and Streaming.

Four primary data categories are stored in HBase: statistical/report data (small volume, flexible SQL queries via Phoenix), raw fact data (large volume, high consistency and availability, real‑time writes), intermediate results for model training (large volume, high throughput), and backup data for disaster recovery.

Use‑case 1: Order Events

Recent orders are cached in Redis; older or Redis‑unavailable queries fall back to HBase, requiring rowkey designs such as reverse(order_id)+(MAX_LONG‑TS) for order status tables and reverse(passenger_id|driver_id)+(MAX_LONG‑TS) for order history tables.

Use‑case 2: Driver‑Passenger Trajectory

Real‑time and batch trajectory queries demand high‑throughput storage; GeoHash is employed to encode latitude/longitude into strings, enabling region‑based rowkey designs like reverse(user_id)+(Integer.MAX_LONG‑TS/1000) for per‑user queries and reverse(geohash)+ts/1000+user_id for range queries, with secondary distance filtering for circular areas.

Use‑case 3: ETA (Estimated Time of Arrival)

ETA calculations transitioned from offline to real‑time by using HBase as a key‑value cache; model training runs on Spark every 30 minutes, reading city data from HBase and persisting results back to HDFS.

Rowkey format: salting+cited+type0+type1+type2+TS; columns store order and feature data.

Use‑case 4: Monitoring Tool DCM

DCM monitors Hadoop cluster resources, ingesting metrics via Phoenix into HBase to provide near‑real‑time dashboards for HDFS usage, file counts, and MapReduce job statistics.

Multi‑Tenant Management in Didi HBase

Didi developed the Didi HBase Service (DHS) to manage project lifecycles, user permissions, cluster resources, and table‑level monitoring, leveraging HBase namespaces and RS Group isolation.

Projects are evaluated for resource needs; shared pools are used for low‑latency, low‑traffic workloads, while high‑SLA online services receive dedicated RS Groups with 20‑30% resource headroom.

RS Group

RS Group partitions a cluster into logical sub‑clusters, assigning specific RegionServers to groups, preventing cross‑group region migration and providing resource isolation without deploying separate physical clusters.

Overall, effective table design and resource control are critical for scaling HBase across Didi's diverse services, reducing operational overhead and improving user experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Resource Management HBase Multi‑tenant GeoHash Rowkey Design dhs

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.