Path Analysis Model Design and Engineering Implementation for Internet Data Operations
The article details the design and engineering of a high‑performance path analysis model for internet data operations, explaining session handling, Sankey visualizations, adjacency‑table storage, multi‑granular session partitioning, Spark‑to‑ClickHouse pipelines, and optimizations that enable billion‑scale user‑path queries in about one second.
This article introduces the design and implementation of a path analysis model for internet data operations. Path analysis is a unique data analysis method in the internet industry used to visualize and analyze user behavior paths within products.
Application Scenarios: The model addresses questions such as: what are the main user paths ranked by conversion rate, where do users deviate from expected paths, and how do behavioral paths differ across user segments. A practical business scenario demonstrates analyzing the main behavior paths for "active users" reaching a target landing page (small video page) with billions of daily data volume and query results in approximately 1 second.
Core Concepts: The article explains key concepts including Session (a time-bounded series of interactions), Sankey diagrams (flow diagrams with branch widths representing data flow), adjacency tables (graph compression storage structure), tree pruning (removing insignificant nodes), and PV/SV metrics (Page View and Session View counts).
Data Model Design: The data flows from a unified data warehouse through Spark computation to ClickHouse, with Hive for cold backup. The model uses flexible session partitioning supporting multiple time granularities (5, 10, 15, 30, 60 minutes). Key processing steps include: obtaining page information and partitioning sessions, deduplicating adjacent pages, extracting 4-level forward/backward pages for each page, calculating PV/SV for positive and negative paths, and computing conversion rates at each level.
Engineering Architecture: The backend constructs Sankey diagrams by building weighted path trees using adjacency tables organized by level. The implementation includes reading data layer by layer, constructing bidirectional edge relationships (parent-child and child-parent), pruning to remove isolated nodes and incomplete paths, and finally constructing the adjacency table for visualization.
Technical Implementation: The system uses ClickHouse for its columnar storage and extremely fast query performance. The article also describes optimizations for distributed table writing, reducing TCP connection wait numbers by over 72% and input traffic peaks by over 88% through DNS polling to write local tables.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.