Big Data 17 min read

Feature Platform Architecture and Stream‑Batch Integrated Solutions

This talk presents Shuhe Technology’s feature platform, detailing its four‑layer architecture, feature storage services, stream‑batch integrated processing, event‑center design, consistency models, and four model‑strategy invocation schemes, illustrating data flows from MySQL through Sqoop, Kafka, Flink, HBase and ClickHouse.

DataFunTalk

Jul 26, 2022

Feature Platform Architecture and Stream‑Batch Integrated Solutions

Speaker Yang Hanbing, data development expert at Shuhe Technology, introduced the feature platform built at Shuhe.

Feature Platform Overview – The platform consists of four layers: data service, storage service, compute engine, and raw storage. Data service provides API point queries, range queries, event messages, and synchronous compute. Feature row store and column store support point and range queries. Compute engines include offline and stream‑batch integrated engines. Raw storage supports offline processing and event storage for stream‑batch.

Data flow example using MySQL: offline data is extracted via Sqoop to EMR, processed with Hive, and stored in HBase (row store) and ClickHouse (column store). MySQL binlog streams to Kafka, consumed by Flink for stream‑batch processing and also written to event storage HBase.

Feature Storage Service – Features are classified into synchronous, real‑time, offline, and instant‑compute features, each with specific write and correction mechanisms. Offline pipelines are necessary to correct errors and ensure consistency.

Four data flow patterns are illustrated: real‑time feature flow, offline feature flow, synchronous feature flow, and instant‑compute feature flow, each showing how MySQL, Kafka, Flink, HBase and ClickHouse interact.

Stream‑Batch Integrated Solution – Addresses challenges such as missing data, complex offline logic, and the need for unified batch and point‑query capabilities. Proposed solutions include unified data storage, consistent logic via stream‑first processing, and self‑service model deployment.

The solution combines Lambda and Kappa architectures: the first half follows Lambda with an HBase event store, the second half follows Kappa for unified stream‑batch processing.

Event Center – Implements a Lambda‑style event store for all change data, supports hot and cold storage, watermark mechanisms, and asynchronous‑to‑synchronous conversion. Data from MySQL, Kafka, and other sources are forwarded to Kafka, consumed by Flink, and written to hot HBase; offline corrections are also written via EMR.

Consistency levels are defined as final consistency, trigger‑stream strong consistency (with possible delay), query‑strong consistency (with possible delay), and query‑strong consistency without delay, each with corresponding watermark or source‑pull strategies.

Stream‑Batch Jobs – Implemented with PyFlink, allowing developers familiar with Python to write trigger, main logic, and output components, reusing encapsulated code for trigger, query, and output.

Model Strategy Invocation Schemes – Four approaches: (1) Feature storage service, where Flink pre‑computes results stored in HBase/ClickHouse; (2) Interface‑triggered polling, where a request triggers Kafka‑based Flink computation and polls for results; (3) Interface‑triggered message reception, where the caller receives result messages; (4) Direct message reception, a pure streaming path from Kafka to Flink to a message queue.

Each scheme’s workflow and timing diagrams are presented, highlighting when data becomes available and how latency is handled.

In summary, the platform provides unified feature storage, consistent stream‑batch processing, flexible model invocation, and robust event‑center mechanisms to support real‑time, near‑real‑time, and offline analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data pipeline Flink stream processing Kafka ClickHouse HBase feature platform

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.