Big Data 11 min read

Advanced ClickHouse Path Analysis, Funnel, Retention, and Session Statistics

This article demonstrates how to leverage ClickHouse’s parametric aggregate and higher‑order functions to perform path matching, intelligent path detection, ordered funnel conversion, retention calculation, and session statistics for user behavior analysis in a big‑data environment.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Advanced ClickHouse Path Analysis, Funnel, Retention, and Session Statistics

ClickHouse provides rich parametric aggregate functions and higher‑order functions based on arrays and Lambda expressions, which can be combined to achieve powerful user‑behavior analyses such as path matching, intelligent path detection, ordered funnel conversion, retention measurement, and session statistics.

Path Matching

The built‑in sequenceMatch function checks whether an event chain satisfies a given pattern, while sequenceCount counts the number of matching chains. Example query:

SELECT
  site_id,
  sequenceMatch('(?1)(?t<=15)(?2).*(?3)')(
    ts_date_time,
    event_type = 'shtLogon',
    event_type = 'shtKkclick' AND column_type = 'homePage',
    event_type = 'shtAddCart'
  ) AS is_match
FROM ods.analytics_access_log_all
WHERE ts_date >= '2020-07-01'
  AND site_id IN (10266,10022,10339,10030)
GROUP BY site_id;

Similarly, sequenceCount returns the number of matching sequences.

The pattern syntax supports three constructs: (?N) – the N‑th event in the sequence (starting from 1). (?t op secs) – a time constraint between two events (e.g., (?1)(?t<=15)(?2) means the interval between event 1 and event 2 must be ≤ 15 seconds). .* – any non‑specified events.

Intelligent Path Detection

When the built‑in functions are insufficient, a more complex requirement can be satisfied by combining arrays and higher‑order functions. The following long query extracts detailed paths that meet a given endpoint, intermediate points, and maximum time gap, and ranks them by user count:

SELECT
  result_chain,
  uniqCombined(user_id) AS user_count
FROM (
  WITH
    toUInt32(maxIf(ts_date_time, event_type = 'shtOrderDone')) AS end_event_maxt,
    arrayCompact(arraySort(
      x -> x.1,
      arrayFilter(
        x -> x.1 <= end_event_maxt,
        groupArray((toUInt32(ts_date_time), (event_type, column_type)))
      )
    )) AS sorted_events,
    arrayEnumerate(sorted_events) AS event_idxs,
    arrayFilter(
      (x, y, z) -> z.1 <= end_event_maxt AND (z.2.1 = 'shtOrderDone' OR y > 600),
      event_idxs,
      arrayDifference(sorted_events.1),
      sorted_events
    ) AS gap_idxs,
    arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
    arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
    arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events
  SELECT
    user_id,
    arrayJoin(split_events) AS event_chain_,
    arrayCompact(event_chain_.2) AS event_chain,
    hasAll(event_chain, [('shtKkClick','homePage')]) AS has_midway_hit,
    arrayStringConcat(arrayMap(
      x -> concat(x.1, '#', x.2),
      event_chain
    ), ' -> ') AS result_chain
  FROM (
    SELECT ts_date, ts_date_time, event_type, column_type, user_id
    FROM ods.analytics_access_log_all
    WHERE ts_date >= '2020-06-30' AND ts_date <= '2020-07-02'
      AND site_id IN (10266,10022,10339,10030)
  )
  GROUP BY user_id
  HAVING length(event_chain) > 1
)
WHERE event_chain[length(event_chain)].1 = 'shtOrderDone'
  AND has_midway_hit = 1
GROUP BY result_chain
ORDER BY user_count DESC
LIMIT 20;

The query works by:

Aggregating each user’s events into (timestamp, (event_type, column_type)) tuples and sorting them.

Generating index arrays and identifying split points based on the endpoint event or large time gaps.

Creating a mask to split the original chain into separate sessions.

Flattening the split chains, removing consecutive duplicates, and checking for required intermediate events.

Formatting the final path string and counting distinct users.

Ordered Funnel Conversion

ClickHouse’s windowFunnel function implements a sliding‑window funnel, returning the deepest step reached within the specified time window. Example:

SELECT
  level,
  user_count,
  conv_rate_percent
FROM (
  SELECT
    level,
    uniqCombined(user_id) AS user_count,
    neighbor(user_count, -1) AS prev_user_count,
    if(prev_user_count = 0, -1, round(user_count / prev_user_count * 100, 3)) AS conv_rate_percent
  FROM (
    SELECT
      user_id,
      windowFunnel(900)(
        ts_date_time,
        event_type = 'shtLogon',
        event_type = 'shtKkClick' AND column_type = 'homePage',
        event_type = 'shtOpenGoodsDetail',
        event_type = 'shtAddCart',
        event_type = 'shtOrderDone'
      ) AS level
    FROM (
      SELECT ts_date, ts_date_time, event_type, column_type, user_id
      FROM ods.analytics_access_log_all
      WHERE ts_date >= '2020-06-30' AND ts_date <= '2020-07-02'
        AND site_id IN (10266,10022,10339,10030)
    )
    GROUP BY user_id
  )
  WHERE level > 0
  GROUP BY level
  ORDER BY level ASC
);

To incorporate field‑level correlation (e.g., product ID across steps), the community project xFunnel can be used after modifying and recompiling ClickHouse.

User Retention

The retention function evaluates a series of conditions and returns an array of 0/1 indicating whether each condition is satisfied, enabling easy calculation of retention rates. Example calculating next‑day and seven‑day repeat order rates:

SELECT
  sum(ret[1]) AS original,
  sum(ret[2]) AS next_day_ret,
  round(next_day_ret / original * 100, 3) AS next_day_ratio,
  sum(ret[3]) AS seven_day_ret,
  round(seven_day_ret / original * 100, 3) AS seven_day_ratio
FROM (
  WITH toDate('2020-06-24') AS first_date
  SELECT
    user_id,
    retention(
      ts_date = first_date,
      ts_date = first_date + INTERVAL 1 DAY,
      ts_date = first_date + INTERVAL 7 DAY
    ) AS ret
  FROM ods.ms_order_done_all
  WHERE ts_date >= first_date AND ts_date <= first_date + INTERVAL 7 DAY
  GROUP BY user_id
);

Session Statistics

A session is a series of user actions within a defined inactivity gap (e.g., 30 minutes). The following query counts daily sessions per user using array functions:

SELECT
  ts_date,
  sum(length(session_gaps)) AS session_cnt
FROM (
  WITH
    arraySort(groupArray(toUInt32(ts_date_time))) AS times,
    arrayDifference(times) AS times_diff
  SELECT
    ts_date,
    arrayFilter(x -> x > 1800, times_diff) AS session_gaps
  FROM ods.analytics_access_log_all
  WHERE ts_date >= '2020-06-30'
  GROUP BY ts_date, user_id
)
GROUP BY ts_date;

This approach mirrors the intelligent path detection logic: sorting timestamps, computing differences, filtering gaps exceeding the timeout, and aggregating the count.

Overall, the article showcases how ClickHouse’s advanced functions and array‑based processing enable sophisticated behavioral analytics—path matching, custom funneling, retention, and sessionization—without requiring external tooling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AnalyticsBig DatasqlclickhousePath Analysis
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.