Big Data 11 min read

BitBase: An HBase‑Based Solution for Billion‑Scale User Feature Analysis at Kuaishou

This article describes how Kuaishou built BitBase on HBase to store and compute billions of user feature logs with millisecond‑level latency, covering business requirements, technical selection, bitmap data modeling, system architecture, device‑ID handling, performance results, and future roadmap.

DataFunTalk
DataFunTalk
DataFunTalk
BitBase: An HBase‑Based Solution for Billion‑Scale User Feature Analysis at Kuaishou

Kuaishou has been using HBase for about two years in various scenarios such as short‑video storage, IM, and live‑stream comment feeds. This talk focuses on one specific use case: applying HBase to analyze and serve user feature data at the hundred‑billion level.

Business Requirements and Challenges

The goal is to compute retention metrics (7‑90 days) across any combination of dimensions (city, gender, interests, etc.) on logs that reach the hundred‑billion scale, with a response time of 1‑2 seconds for analysts.

Massive log volume (hundreds of billions)

Arbitrary multi‑dimensional queries

Sub‑second latency requirements

Technical Selection

Three alternatives were evaluated:

Hive – easy SQL but minute‑level latency

Elasticsearch – good for inverted indexes but slower for exact deduplication

ClickHouse – fast for analytics but still >10 s on small clusters

Because none satisfied the latency and flexibility needs, a custom solution named BitBase was designed on top of HBase.

BitBase Solution

Data Model

Raw data values are abstracted into bitmaps (e.g., city=\"bj\" becomes 10100). Multi‑dimensional queries are reduced to bitmap logical operations (AND, OR, XOR) followed by a count of set bits, which yields the user IDs matching the criteria.

Architecture

The system consists of five components:

Data storage – bitmap indexes and dictionary archives

Data conversion – batch (mrjob) or online ingestion

Computation – scheduling and execution, returning results to the client

Client – business‑level APIs

Zookeeper – distributed coordination

Storage Module

Bitmaps are split into meta information (identifying db, table, event, entity, version) and data blocks (the actual bit arrays). Three HBase tables store BitmapMeta, BlockData, and BlockMeta.

Computation Module

The workflow involves BitBase Client → BitBase Server → HBase RegionServer. The server parses the bitmap meta, splits the expression into sub‑expressions, routes them (local coprocessor or remote servers), aggregates results, and returns them. Local computation is 3‑5× faster than non‑local.

DeviceId Problem and Solution

To support DeviceId, a three‑table mapping (meta, index→DeviceId, DeviceId→index) is built using a two‑phase commit in HBase, ensuring continuity, consistency, reversibility, and fast conversion. Archiving and MRJob‑based joins accelerate bulk conversion.

Business Effect

Benchmarks show that latency does not increase with the number of dimensions because irrelevant bitmap blocks are skipped. BitBase delivers sub‑second response for multi‑dimensional retention analysis across billions of rows.

Future Plans

Upcoming work includes real‑time aggregation (<5 min latency), SQL‑style query support, and open‑sourcing the project to foster community contributions.

Big DataHBasebitmap indexScalable storageUser AnalyticsBitBase
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.