Comprehensive Guide to User Crowd Analysis: Distribution, Metrics, Drill‑down, Cross, and Comparative Methods with Implementation Details
This article explains the concepts, analytical methods, visualizations, and SQL implementation of user crowd analysis—including distribution, metric, drill‑down, cross, and comparative analyses—while also covering trend monitoring, TGI calculation, and handling of array‑type tags in ClickHouse and Hive.
Audience profiling (crowd analysis) aims to deepen the understanding of a pre‑defined user group by examining various dimensions such as distribution, metrics, drill‑down, cross‑analysis, and comparative analysis.
1. Distribution analysis calculates the proportion of label values (e.g., gender, province, interests) and is best visualized with pie, ring, or bar charts. Only enumerable labels with limited distinct values are suitable for this analysis.
2. Metric analysis aggregates quantifiable tags (e.g., online time, follower count, recharge amount) using functions like SUM, AVG, MIN, MAX. Results can be shown as numeric dashboards or time‑series line charts for trend monitoring and alerting.
3. Drill‑down analysis adds a secondary dimension to a primary distribution (e.g., province distribution within male users), revealing deeper insights that single‑layer distribution cannot provide.
4. Cross analysis combines multiple dimensions (e.g., gender × province) to compute aggregated metrics for each combination, often visualized with color‑coded tables.
5. Comparative analysis contrasts two crowds (e.g., A vs. B) using distribution data and the TGI index, where TGI = (ratio in target crowd ÷ ratio in reference crowd) × 100; values far from 100 indicate strong differences.
Implementation details involve configuring analysis tags in the platform, storing configurations in demo_userprofile_crowd, and persisting results in demo_userprofile_crowd_overview. The analysis engine joins crowd result tables with a wide user‑profile table, then applies GROUP BY and aggregation functions. Example SQL for gender distribution:
SELECT
gender, count(1) AS cnt
FROM (
SELECT user_id FROM userprofile_demo.crowd_result_table_ch WHERE crowd_id = 100
) t1
INNER JOIN (
SELECT user_id, gender FROM userprofile_demo.userprofile_wide_table_ch WHERE p_date = '2022-08-26'
) t2 ON (t1.user_id = t2.user_id)
GROUP BY gender;Metric (average online time) example:
SELECT avg(online_time) AS avgValue
FROM (
SELECT user_id FROM userprofile_demo.crowd_result_table_ch WHERE crowd_id = 100
) t1
INNER JOIN (
SELECT user_id, online_time FROM userprofile_demo.userprofile_wide_table_ch WHERE p_date = '2022-08-26'
) t2 ON (t1.user_id = t2.user_id);Drill‑down (province distribution for male users) example:
SELECT province, count(1) AS cnt
FROM (
SELECT user_id FROM userprofile_demo.crowd_result_table_ch WHERE crowd_id = 100
) t1
INNER JOIN (
SELECT user_id, province FROM userprofile_demo.userprofile_wide_table_ch WHERE p_date = '2022-08-26' AND gender = '男'
) t2 ON (t1.user_id = t2.user_id)
GROUP BY province;Cross analysis (gender × province average online time) example:
SELECT gender, province, avg(online_time) AS avgValue
FROM (
SELECT user_id FROM userprofile_demo.crowd_result_table_ch WHERE crowd_id = 100
) t1
INNER JOIN (
SELECT user_id, gender, province, online_time FROM userprofile_demo.userprofile_wide_table_ch WHERE p_date = '2022-08-26'
) t2 ON (t1.user_id = t2.user_id)
GROUP BY gender, province;For array‑type tags such as interests, Hive uses LATERAL VIEW EXPLODE and ClickHouse uses arrayJoin to flatten the array before aggregation. Example for interests distribution in ClickHouse:
SELECT item, count(1) AS cnt
FROM (
SELECT arrayJoin(interests) AS item
FROM (
SELECT user_id FROM userprofile_demo.crowd_result_table_ch WHERE crowd_id = 100
) t1
INNER JOIN (
SELECT user_id, interests FROM userprofile_demo.userprofile_wide_table_ch WHERE p_date = '2022-08-26'
) t2 ON (t1.user_id = t2.user_id)
) GROUP BY item;The platform also supports trend charts for automatically updated crowds, alarm thresholds for key metrics, and the ability to generate new crowds based on analysis results (e.g., users interested in "military").
Overall, the crowd‑analysis workflow consists of configuring label dimensions, executing SQL‑based calculations via ClickHouse/Hive, storing results, and visualizing them through dashboards, charts, and TGI reports.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
