How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior
This article explains the concepts, formulas, and step‑by‑step implementation of a user‑retention analysis model, covering both Hive‑based offline processing and ClickHouse‑accelerated real‑time queries, complete with SQL examples, architecture diagrams, and practical optimization tips.
Background and Motivation
China's internet user base has reached 1.079 billion, and the market has entered a saturation phase where retaining existing users is more valuable than acquiring new ones. Retention analysis helps identify loyal users, diagnose churn, and evaluate product changes.
Retention Model Overview
The retention model measures the proportion of users who trigger a start event and later trigger a revisit event within a defined time window. Start and revisit events can be identical (e.g., repeated sign‑ins) or different (e.g., order → payment).
Analysis Logic
Retention rate is calculated for each day (day 0, day 1, …) by intersecting the set of users who performed the start event with those who performed the revisit event after the specified lag. The daily retention curve shows how quickly the user base decays.
Step‑by‑Step Implementation
1. Choose start and revisit events
Example: start = “open browser”, revisit = “close browser”.
2. Set retention days
In the example, a 3‑day retention window is used.
3. Define the analysis date range
For a range 2023‑01‑06 ~ 2023‑01‑08, the system computes day‑0, day‑1, day‑2, and day‑3 retention for each start date.
4. Compute retention metrics
Start users = users who triggered the start event on the calculation date.
Day‑0 retention = intersection of start‑event users and same‑day revisit users.
Day‑1 retention = intersection of start‑event users and revisit users on the next day.
Day‑2 retention = …
Day‑3 retention = …
Retention rate = retention count / start users × 100%.
Sample tables (Table 1, Table 2) illustrate raw counts and percentages for the example dates.
Offline Architecture (Hive)
The offline pipeline consists of four stages: configuration, computation, storage, and visualization. Configuration is handled by a backend service that assembles Hive SQL tasks based on user‑defined events and filters. Spark executes the Hive queries, and results are persisted in MySQL for downstream display.
SQL for Offline Retention
select
'origin_day' as origin_day,
a.day as day,
datediff('origin_day', a.day) as diff,
count(distinct a.uid) as user,
count(distinct case when b.uid is not null then b.uid end) as retention
FROM (
SELECT day, uid
FROM abcd.test
WHERE day >= if('start_time' >= date_sub('origin_day','retention_days'),'start_time',date_sub('origin_day','retention_days'))
AND day <= if('end_time' <= 'origin_day','end_time','origin_day')
AND event_id = 'start_event'
) a
LEFT JOIN (
SELECT s.uid
FROM abcd.test s
WHERE s.day = 'origin_day'
AND s.event_id = 'visit_event'
GROUP BY s.uid
) b ON a.uid = b.uid
WHERE day >= if('start_time' >= date_sub('origin_day','retention_days'),'start_time',date_sub('origin_day','retention_days'))
AND day <= if('end_time' <= 'origin_day','end_time','origin_day')
GROUP BY a.day;The query returns origin_day, day, diff, user count, and retention count for each day in the selected interval.
Optimization with ClickHouse
ClickHouse’s column‑oriented storage and distributed query engine dramatically reduce latency for large‑scale retention calculations. The workflow is:
Ingest raw event data into ClickHouse.
Generate retention‑specific SQL based on the configured start/visit events.
Execute the query in ClickHouse for near‑real‑time results.
Visualization tools can then connect to ClickHouse to render retention curves.
ClickHouse Retention Function
The retention function accepts up to 32 UInt8 conditions and returns an array of 0/1 values indicating which conditions are satisfied. Example syntax:
retention(cond1, cond2, ..., cond32);It is used in the ClickHouse query to compute daily retention flags.
Practical Takeaways
The retention model is a core component of a data‑analysis toolbox. It can be combined with funnel, path, and attribution analyses to provide a comprehensive view of user behavior, guide product improvements, and support data‑driven decision making.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
