Advanced ClickHouse Path Analysis, Funnel, Retention, and Session Statistics
This article demonstrates how to leverage ClickHouse’s parametric aggregate and higher‑order functions to perform path matching, intelligent path detection, ordered funnel conversion, retention calculation, and session statistics for user behavior analysis in a big‑data environment.
ClickHouse provides rich parametric aggregate functions and higher‑order functions based on arrays and Lambda expressions, which can be combined to achieve powerful user‑behavior analyses such as path matching, intelligent path detection, ordered funnel conversion, retention measurement, and session statistics.
Path Matching
The built‑in sequenceMatch function checks whether an event chain satisfies a given pattern, while sequenceCount counts the number of matching chains. Example query:
SELECT
site_id,
sequenceMatch('(?1)(?t<=15)(?2).*(?3)')(
ts_date_time,
event_type = 'shtLogon',
event_type = 'shtKkclick' AND column_type = 'homePage',
event_type = 'shtAddCart'
) AS is_match
FROM ods.analytics_access_log_all
WHERE ts_date >= '2020-07-01'
AND site_id IN (10266,10022,10339,10030)
GROUP BY site_id;Similarly, sequenceCount returns the number of matching sequences.
The pattern syntax supports three constructs: (?N) – the N‑th event in the sequence (starting from 1). (?t op secs) – a time constraint between two events (e.g., (?1)(?t<=15)(?2) means the interval between event 1 and event 2 must be ≤ 15 seconds). .* – any non‑specified events.
Intelligent Path Detection
When the built‑in functions are insufficient, a more complex requirement can be satisfied by combining arrays and higher‑order functions. The following long query extracts detailed paths that meet a given endpoint, intermediate points, and maximum time gap, and ranks them by user count:
SELECT
result_chain,
uniqCombined(user_id) AS user_count
FROM (
WITH
toUInt32(maxIf(ts_date_time, event_type = 'shtOrderDone')) AS end_event_maxt,
arrayCompact(arraySort(
x -> x.1,
arrayFilter(
x -> x.1 <= end_event_maxt,
groupArray((toUInt32(ts_date_time), (event_type, column_type)))
)
)) AS sorted_events,
arrayEnumerate(sorted_events) AS event_idxs,
arrayFilter(
(x, y, z) -> z.1 <= end_event_maxt AND (z.2.1 = 'shtOrderDone' OR y > 600),
event_idxs,
arrayDifference(sorted_events.1),
sorted_events
) AS gap_idxs,
arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events
SELECT
user_id,
arrayJoin(split_events) AS event_chain_,
arrayCompact(event_chain_.2) AS event_chain,
hasAll(event_chain, [('shtKkClick','homePage')]) AS has_midway_hit,
arrayStringConcat(arrayMap(
x -> concat(x.1, '#', x.2),
event_chain
), ' -> ') AS result_chain
FROM (
SELECT ts_date, ts_date_time, event_type, column_type, user_id
FROM ods.analytics_access_log_all
WHERE ts_date >= '2020-06-30' AND ts_date <= '2020-07-02'
AND site_id IN (10266,10022,10339,10030)
)
GROUP BY user_id
HAVING length(event_chain) > 1
)
WHERE event_chain[length(event_chain)].1 = 'shtOrderDone'
AND has_midway_hit = 1
GROUP BY result_chain
ORDER BY user_count DESC
LIMIT 20;The query works by:
Aggregating each user’s events into (timestamp, (event_type, column_type)) tuples and sorting them.
Generating index arrays and identifying split points based on the endpoint event or large time gaps.
Creating a mask to split the original chain into separate sessions.
Flattening the split chains, removing consecutive duplicates, and checking for required intermediate events.
Formatting the final path string and counting distinct users.
Ordered Funnel Conversion
ClickHouse’s windowFunnel function implements a sliding‑window funnel, returning the deepest step reached within the specified time window. Example:
SELECT
level,
user_count,
conv_rate_percent
FROM (
SELECT
level,
uniqCombined(user_id) AS user_count,
neighbor(user_count, -1) AS prev_user_count,
if(prev_user_count = 0, -1, round(user_count / prev_user_count * 100, 3)) AS conv_rate_percent
FROM (
SELECT
user_id,
windowFunnel(900)(
ts_date_time,
event_type = 'shtLogon',
event_type = 'shtKkClick' AND column_type = 'homePage',
event_type = 'shtOpenGoodsDetail',
event_type = 'shtAddCart',
event_type = 'shtOrderDone'
) AS level
FROM (
SELECT ts_date, ts_date_time, event_type, column_type, user_id
FROM ods.analytics_access_log_all
WHERE ts_date >= '2020-06-30' AND ts_date <= '2020-07-02'
AND site_id IN (10266,10022,10339,10030)
)
GROUP BY user_id
)
WHERE level > 0
GROUP BY level
ORDER BY level ASC
);To incorporate field‑level correlation (e.g., product ID across steps), the community project xFunnel can be used after modifying and recompiling ClickHouse.
User Retention
The retention function evaluates a series of conditions and returns an array of 0/1 indicating whether each condition is satisfied, enabling easy calculation of retention rates. Example calculating next‑day and seven‑day repeat order rates:
SELECT
sum(ret[1]) AS original,
sum(ret[2]) AS next_day_ret,
round(next_day_ret / original * 100, 3) AS next_day_ratio,
sum(ret[3]) AS seven_day_ret,
round(seven_day_ret / original * 100, 3) AS seven_day_ratio
FROM (
WITH toDate('2020-06-24') AS first_date
SELECT
user_id,
retention(
ts_date = first_date,
ts_date = first_date + INTERVAL 1 DAY,
ts_date = first_date + INTERVAL 7 DAY
) AS ret
FROM ods.ms_order_done_all
WHERE ts_date >= first_date AND ts_date <= first_date + INTERVAL 7 DAY
GROUP BY user_id
);Session Statistics
A session is a series of user actions within a defined inactivity gap (e.g., 30 minutes). The following query counts daily sessions per user using array functions:
SELECT
ts_date,
sum(length(session_gaps)) AS session_cnt
FROM (
WITH
arraySort(groupArray(toUInt32(ts_date_time))) AS times,
arrayDifference(times) AS times_diff
SELECT
ts_date,
arrayFilter(x -> x > 1800, times_diff) AS session_gaps
FROM ods.analytics_access_log_all
WHERE ts_date >= '2020-06-30'
GROUP BY ts_date, user_id
)
GROUP BY ts_date;This approach mirrors the intelligent path detection logic: sorting timestamps, computing differences, filtering gaps exceeding the timeout, and aggregating the count.
Overall, the article showcases how ClickHouse’s advanced functions and array‑based processing enable sophisticated behavioral analytics—path matching, custom funneling, retention, and sessionization—without requiring external tooling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
