Big Data 10 min read

Offline and Real-Time User Profile Fusion Architecture

The architecture combines a nightly batch job that generates offline user profiles stored in HBase with a Flink‑based stream layer that lazily loads those profiles on app start and creates real‑time updates, then fuses both streams into a unified, timestamp‑ordered profile in Redis, forming a Lambda‑style pipeline.

DeWu Technology

Oct 10, 2022

User profiling, i.e., labeling user information, is a modeling technique that helps enterprises locate precise user groups and capture diverse feedback. This article describes the data pipeline for offline (batch) and real-time profiling and explains how to fuse them online.

The system consists of three layers: batch processing, stream processing, and data fusion.

Batch Processing Layer runs on a scheduled job (DataWorks) that computes daily offline profiles (T+1) from historical behavior data. The results are stored lazily in HBase and loaded the first time a user launches the app. Two main steps are performed:

Generate daily offline profiles and import them into HBase.

Write a log entry to HBase indicating the profile version, size, and completion time, which serves as a flag for later lazy‑loading.

Stream Processing Layer handles two tasks: real‑time profiling via Flink on user behavior streams, and lazy loading of offline profiles when a user logs in.

The lazy‑loading workflow includes:

Subscribe to the APPSTART event (user login).

Check the HBase log to decide whether to load the T+1 offline profile; if the log exists, load it once per day and record the load state in Flink to avoid repeated accesses.

Convert profile formats according to tag configuration.

Wrap the converted profile into a unified Action format and push it to a message queue for downstream fusion.

Real‑time profiling steps:

Flink subscribes to user behavior streams and processes them according to business rules.

Construct Action messages (containing tag name, value, operator, timestamp, etc.) and send them to Kafka.

The profiling framework consumes the Actions, applies the configured operators (e.g., map, list, string).

Write the resulting real‑time profile to Redis.

Profile Fusion Layer merges offline and real‑time profiles. Uniform data formats are required; otherwise fusion is impossible. Example tag [{"cspuId":111, "et":1663234014003, "channel":1}, ...] illustrates a click‑behavior list. Offline profiles use the list.rpushl operator (addAll), while real‑time updates use list.rpush (add). The fusion framework consumes Actions from the queue, de‑duplicates, sorts by timestamp, respects size limits, and updates Redis accordingly, ensuring complete and up‑to‑date user profiles.

The overall pipeline resembles a Lambda architecture, combining batch‑computed offline profiles with low‑latency stream updates to deliver a comprehensive user portrait.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink stream processing Redis Batch Processing data fusion HBase user profiling

Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.