Big Data 15 min read

Kafka-based Real-Time Data Warehouse: Architecture and Practice for Search

The article explains how Kafka serves as the core of a real‑time data warehouse for search, detailing its advantages over traditional databases, integration with Flink for low‑latency stream processing, architectural patterns such as Lambda/Kappa, scaling challenges, and comprehensive monitoring using Kafka Eagle.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Kafka-based Real-Time Data Warehouse: Architecture and Practice for Search

Apache Kafka has become a mature messaging queue component and a crucial part of the big data ecosystem. Its active community continuously contributes code and iterates the project, making Kafka increasingly feature‑rich and stable, which positions it as a key element in enterprise big‑data architectures.

This article discusses the practical application of a Kafka‑based real‑time data warehouse for search scenarios.

Why Kafka is needed

Before designing a big‑data architecture, technical research is performed to determine whether Kafka can meet current requirements. Early data architectures stored simple data types in relational databases (MySQL, Oracle). As business grew, data types expanded, requiring big‑data clusters and data warehouses for categorized storage, as shown in the first diagram.

Traditional data warehouses have a latency of T+1, which is insufficient for latency‑sensitive services such as IoT, micro‑services, and mobile apps that require real‑time processing.

Kafka’s emergence

Kafka provides a unified storage solution for complex business data and enables data sharding through stream processing. Various data types (video, game, music, etc.) can be stored in Kafka and then routed to downstream systems such as data warehouses or KV stores for real‑time analysis.

Kafka combines the advantages of traditional message queues and the produce/consume model, offering scalability, durable storage, real‑time processing, sequential I/O, memory‑mapped files, zero‑copy, and efficient storage.

Simple application scenario

A gaming example illustrates how user purchase events are captured in real time and stored in Kafka, enabling downstream processing and analytics.

What problems Kafka solves

Traditional architectures funnel all traffic through a single SQL database, causing bottlenecks. Logs are collected into Hadoop for offline processing, and data is duplicated across caching, search, and reporting systems, leading to complex data synchronization challenges. Kafka decouples producers and consumers, reducing integration complexity from O(N²) to O(N) and simplifying scaling.

Real‑time data warehouse practice

The typical real‑time warehouse consists of three modules: message queue (Kafka), computation engine (Flink or Spark), and storage. The architecture integrates Kafka with the BDSP platform for computation and storage.

Stream processing engine selection

Flink and Spark are the two mainstream engines. Flink is chosen for its high throughput, low latency, flexible windowing, lightweight fault tolerance, and unified batch‑stream processing.

Challenges in building a real‑time warehouse

Small Kafka clusters with large topics cause high I/O pressure, leading to latency and performance alerts. The solution involves splitting large topics and designing a data distribution flow with Flink.

Increasing data volume and consumer tasks further stress the cluster, cause duplicate consumption, and increase data coupling.

Advanced real‑time warehouse designs

Two main architectures are discussed: Lambda (combined batch‑and‑stream) and Kappa (stream‑only). Both are leveraged to achieve high reuse, usability, consistency, and cost‑effective computation.

The layered architecture includes ODS (Kafka topics), DW (Flink processing and enrichment), DIM (dimension storage such as HBase, Redis, MySQL), and DA (aggregated data for KV, BI, OLAP using ClickHouse, HBase, Redis).

Kafka monitoring

Kafka Eagle (now EFAK) is introduced for comprehensive monitoring, offering metrics collection via JMX/API, storage in MySQL/SQLite, dashboards for cluster status, throughput, lag, and KSQL query capabilities, as well as alerting via IM, email, SMS, and phone.

Sample monitoring views include recent 7‑day write volume per topic, KSQL query results, and consumer lag details.

References: Apache Kafka documentation, Kafka Eagle project, and related GitHub repository.

monitoringFlinkStreamingdata integrationreal-time data warehouseSparkApache Kafka
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.