Big Data 11 min read

Design and Implementation of a Unified Data Lake Platform Using HBase, Kafka, and Elasticsearch

This article summarizes the design, architecture, and key modules of a company-wide data lake platform—named “Tianchi”—built on HBase, Kafka, and Elasticsearch, detailing data ingestion, strategy output, metadata management, indexing, monitoring, and offline analysis, and shares lessons learned and future plans.

Big Data Technology Architecture

May 19, 2020

Design and Implementation of a Unified Data Lake Platform Using HBase, Kafka, and Elasticsearch

Background Introduction

The company, with fifteen years of history, has a complex web of business lines and numerous inter‑service interfaces, leading to inefficiencies and high operational costs, prompting a major transformation.

The “Tianchi” system was created to consolidate resources across business lines, turning a tangled spider‑web of services into a simple, direct data pipeline, reducing unnecessary calls, meetings, and data‑access latency, thereby saving product and development time and improving overall efficiency.

Key characteristics of Tianchi: stable, fast, large‑scale, cost‑effective, and clear.

Business Model Overview

After analyzing the company’s various data‑output needs, several common data models were identified:

Key‑Value fast output (simple KV queries, high concurrency, e.g., risk control).

Key‑Map fast output (directional output, e.g., fetching article details by ID).

MultiKey‑Map batch output (e.g., recommendation feed).

C‑List multi‑dimensional query (flexible filters with pagination, e.g., tag‑based product recommendation).

G‑Top ranking output (group‑by ranking, e.g., top‑10 forum posts).

G‑Count statistical analysis (data‑warehouse style analytics).

Multi‑Table mixed output (different tables with different conditions, e.g., mixed content list).

Term tokenized output.

All models can be expressed as index + KV , often outperforming traditional SQL.

Considering long‑term use of Elasticsearch, the final technology stack chosen was HBase + Elasticsearch .

Architecture Design and Module Introduction

The overall architecture consists of six sub‑modules: data ingestion, strategy output, metadata management, index building, platform monitoring, and offline data analysis.

1. Data Ingestion Module

The module wraps the HBase client API, exposing both online RESTful services and offline SDK packages, supporting native HBase API and bulk load. The RESTful service uses long‑lived HBase connections, enabling cross‑language access, permission control, load balancing, failure recovery, dynamic scaling, and monitoring, all powered by Kubernetes.

2. Strategy Output Module

This module implements the business models described earlier. It translates user requests into Elasticsearch DSL, queries ES, and optionally fetches rowkeys to query HBase for final results. Metadata management determines whether fields are covered by indexes or need secondary HBase queries. Users interact via a PolicyID, and the system can cache results when appropriate.

3. Metadata Management Module

Because HBase is schema‑less, this layer provides a virtual schema for both HBase and Elasticsearch, controlling which fields are indexed and enforcing data‑validation rules during ingestion.

4. Index Building Module

Instead of HBase + WAL + ES, the design uses HBase + Kafka + ES. After a successful write to HBase, data is asynchronously pushed to a Kafka queue for ES consumption. Kafka consumer threads are dynamically adjustable, and lag monitoring triggers alerts and thread scaling. The system guarantees eventual consistency and handles failures by blacklisting problematic keys in Redis, with automatic retry and a “dead‑letter” queue for unrecoverable cases.

5. Platform Monitoring Module

Monitoring covers Hadoop, HBase, and Kubernetes clusters, built on Prometheus, Grafana, and Fluentd.

6. Offline Data Analysis Module

Leveraging HBase replication, data is copied to an offline HBase cluster for warehouse integration, Spark read/write, and large‑scale scans, minimizing impact on the real‑time platform.

Reflections

Integrating Elasticsearch with HBase proved harmonious; ES’s powerful indexing complements HBase’s lack of native indexing, making the combination comparable to SQL for many use cases. The system supports advanced ES features such as term queries, aggregations, and top‑N queries.

Although the stack (HBase + Kafka + Elasticsearch) may seem heavy, it offers a valuable learning opportunity across three mature technologies.

Choosing between Elasticsearch and Solr for secondary indexing depends on the company’s specific context.

Future Work

Full‑link multi‑tenant support.

SQL support at the strategy layer.

Continuous optimization and productization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture Elasticsearch kafka Data Platform HBase

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.