Big Data 19 min read

How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior

This article explains the concepts, formulas, and step‑by‑step implementation of a user‑retention analysis model, covering both Hive‑based offline processing and ClickHouse‑accelerated real‑time queries, complete with SQL examples, architecture diagrams, and practical optimization tips.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior

Background and Motivation

China's internet user base has reached 1.079 billion, and the market has entered a saturation phase where retaining existing users is more valuable than acquiring new ones. Retention analysis helps identify loyal users, diagnose churn, and evaluate product changes.

Retention Model Overview

The retention model measures the proportion of users who trigger a start event and later trigger a revisit event within a defined time window. Start and revisit events can be identical (e.g., repeated sign‑ins) or different (e.g., order → payment).

Retention model diagram
Retention model diagram

Analysis Logic

Retention rate is calculated for each day (day 0, day 1, …) by intersecting the set of users who performed the start event with those who performed the revisit event after the specified lag. The daily retention curve shows how quickly the user base decays.

Retention comparison between versions
Retention comparison between versions

Step‑by‑Step Implementation

1. Choose start and revisit events

Example: start = “open browser”, revisit = “close browser”.

2. Set retention days

In the example, a 3‑day retention window is used.

3. Define the analysis date range

For a range 2023‑01‑06 ~ 2023‑01‑08, the system computes day‑0, day‑1, day‑2, and day‑3 retention for each start date.

4. Compute retention metrics

Start users = users who triggered the start event on the calculation date.

Day‑0 retention = intersection of start‑event users and same‑day revisit users.

Day‑1 retention = intersection of start‑event users and revisit users on the next day.

Day‑2 retention = …

Day‑3 retention = …

Retention rate = retention count / start users × 100%.

Sample tables (Table 1, Table 2) illustrate raw counts and percentages for the example dates.

Offline Architecture (Hive)

The offline pipeline consists of four stages: configuration, computation, storage, and visualization. Configuration is handled by a backend service that assembles Hive SQL tasks based on user‑defined events and filters. Spark executes the Hive queries, and results are persisted in MySQL for downstream display.

Hive architecture diagram
Hive architecture diagram

SQL for Offline Retention

select
    'origin_day' as origin_day,
    a.day as day,
    datediff('origin_day', a.day) as diff,
    count(distinct a.uid) as user,
    count(distinct case when b.uid is not null then b.uid end) as retention
FROM (
    SELECT day, uid
    FROM abcd.test
    WHERE day >= if('start_time' >= date_sub('origin_day','retention_days'),'start_time',date_sub('origin_day','retention_days'))
      AND day <= if('end_time' <= 'origin_day','end_time','origin_day')
      AND event_id = 'start_event'
) a
LEFT JOIN (
    SELECT s.uid
    FROM abcd.test s
    WHERE s.day = 'origin_day'
      AND s.event_id = 'visit_event'
    GROUP BY s.uid
) b ON a.uid = b.uid
WHERE day >= if('start_time' >= date_sub('origin_day','retention_days'),'start_time',date_sub('origin_day','retention_days'))
  AND day <= if('end_time' <= 'origin_day','end_time','origin_day')
GROUP BY a.day;

The query returns origin_day, day, diff, user count, and retention count for each day in the selected interval.

Optimization with ClickHouse

ClickHouse’s column‑oriented storage and distributed query engine dramatically reduce latency for large‑scale retention calculations. The workflow is:

Ingest raw event data into ClickHouse.

Generate retention‑specific SQL based on the configured start/visit events.

Execute the query in ClickHouse for near‑real‑time results.

Visualization tools can then connect to ClickHouse to render retention curves.

Combined Hive‑ClickHouse architecture
Combined Hive‑ClickHouse architecture

ClickHouse Retention Function

The retention function accepts up to 32 UInt8 conditions and returns an array of 0/1 values indicating which conditions are satisfied. Example syntax:

retention(cond1, cond2, ..., cond32);

It is used in the ClickHouse query to compute daily retention flags.

Practical Takeaways

The retention model is a core component of a data‑analysis toolbox. It can be combined with funnel, path, and attribution analyses to provide a comprehensive view of user behavior, guide product improvements, and support data‑driven decision making.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSQLuser behaviorClickHouseHiveData visualizationRetention Analysis
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.