Big Data 21 min read

User Path Analysis in the Hunyi System: Design, Computation Logic, and StarRocks Implementation

This article explains user path analysis as a method to visualize and optimize user flow, describes its productization in the Hunyi analytics platform, details the underlying computation logic, presents a complex StarRocks SQL solution, discusses performance challenges, and suggests future improvements and recruitment opportunities.

政采云技术
政采云技术
政采云技术
User Path Analysis in the Hunyi System: Design, Computation Logic, and StarRocks Implementation

Understanding user traffic behavior is essential, and path analysis provides a direct and efficient way to achieve this.

What is User Path Analysis? It examines the sequence of pages a user visits within an application, helping identify drop‑off points and bottlenecks, such as users adding items to a cart but not completing payment.

Product Design in Hunyi System – The Hunyi visual analytics platform integrates event tracking and data analysis. Its path analysis module adopts concepts from Google Analytics and Sensors Data, visualizing flows with Sankey diagrams and offering multi‑dimensional filters (event range, start/end events, user segmentation, session interval, time range, etc.).

Computation Logic – Paths are built around a target object, expanding forward or backward based on start or end events. Rules ensure independence and completeness, handling repeated events, session ordering, and step numbering.

Technical Challenges & Solution – Real‑time OLAP queries on tens of millions of daily events require fast, complex calculations. After discarding offline batch processing and in‑memory Java solutions, the team selected StarRocks for its MPP capabilities. The core SQL (shown below) uses window functions (LAG, ROW_NUMBER, MAX) to segment sessions, flag start events, and aggregate step counts.

SELECT *
FROM (
    SELECT last_page_code, page_code, step, COUNT(*) AS count,
           ROW_NUMBER() OVER (PARTITION BY step ORDER BY count DESC) AS count_rank
    FROM (
        SELECT last_page_code, page_code, step, COUNT(1) AS count
        FROM (
            SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id, group_session_id ORDER BY create_time ASC, last_page_code ASC) AS step
            FROM (
                SELECT *, MAX(is_begin_event) OVER (PARTITION BY user_id, group_session_id ORDER BY create_time ASC, last_page_code ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS is_follow_begin_event
                FROM (
                    SELECT *, IFNULL(flag, MY_WINDOW_LAG(flag) OVER (PARTITION BY user_id ORDER BY create_time, last_create_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS group_session_id
                    FROM (
                        SELECT *, IFNULL(UNIX_TIMESTAMP(create_time) - UNIX_TIMESTAMP(LAG(create_time,1) OVER (PARTITION BY user_id ORDER BY create_time ASC)),0) AS session_interval,
                               LAG(create_time,1) OVER (PARTITION BY user_id ORDER BY create_time ASC) AS last_create_time,
                               IF(UNIX_TIMESTAMP(create_time)-UNIX_TIMESTAMP(LAG(create_time,1) OVER (PARTITION BY user_id ORDER BY create_time ASC)) < ${session_interval},1,0) AS is_begin_event,
                               ...
                        FROM hive.dwd.dwd_hunyi_sign_evt_basic_inc_d
                        WHERE pt >= ${start_time} AND pt <= ${end_time}
                          AND evt_code IN (${event_range})
                    ) t
                ) t2
            ) t3
        ) t4
        WHERE step < ${path_depth}
        GROUP BY last_page_code, page_code, step
        ORDER BY step, last_page_code, page_code
    ) t5
) t6
WHERE count_rank < ${top_n};

An additional example demonstrates using LAG() to compute session intervals and previous page codes.

SELECT IFNULL(UNIX_TIMESTAMP(create_time)-UNIX_TIMESTAMP(LAG(create_time,1) OVER (PARTITION BY user_id ORDER BY create_time ASC)),0) AS session_interval,
       create_time,
       LAG(create_time,1) OVER (PARTITION BY user_id ORDER BY create_time ASC) AS last_create_time,
       user_id,
       evt_code,
       IF(IFNULL(UNIX_TIMESTAMP(create_time)-UNIX_TIMESTAMP(LAG(create_time,1) OVER (PARTITION BY user_id ORDER BY create_time ASC)),0) < ${session_interval},1,0) AS is_begin_event
FROM hive.dwd.dwd_hunyi_sign_evt_basic_inc_d
WHERE pt >= ${start_time} AND pt <= ${end_time}
  AND evt_code IN (${event_range});

Performance & Accuracy Issues – The current SQL, while functional, shows latency (10‑20 s on 300 M rows) due to complex window operations. Potential improvements include pre‑computing adjacent rows, creating materialized views, or developing custom UDF/UDWF in StarRocks for more efficient path calculations.

Future Direction – A native path analysis function could be built directly into StarRocks, similar to its window_funnel feature, to achieve higher efficiency.

Recruitment Notice – The Zero technical team in Hangzhou is hiring passionate engineers across cloud‑native, blockchain, AI, low‑code, middleware, big data, and more. Interested candidates should email zcy‑tc@cai‑inc.com.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationBig DataSQLStarRockssession segmentationuser path analysis
政采云技术
Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.