Big Data 17 min read

Retention Analysis Model Practice Based on ClickHouse

The article explains retention analysis models, their importance for user loyalty, outlines offline Hive architecture, then shows how ClickHouse’s retention() function and columnar storage dramatically speed up multi‑day retention calculations, providing SQL examples and practical guidance for product analytics.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Retention Analysis Model Practice Based on ClickHouse

This article is the 4th in the vivo Internet Big Data team's "User Behavior Analysis Model Practice" series, focusing on retention analysis models.

The article provides a detailed introduction to the concept and basic principles of retention analysis models, and explains their implementation in products. Addressing practical usage challenges, it explores a ClickHouse-based retention analysis model solution.

Background: With China's internet users reaching 1.079 billion and internet penetration at 79.4%, the market has entered an era of stock users. User retention has become more important than user acquisition. Retention analysis helps identify loyal users, understand retention performance, analyze user churn, and evaluate whether target users complete expected behaviors.

Core Concepts: Retention analysis examines the ratio of users who triggered a starting event and subsequently triggered a return visit event within a subsequent time period. The starting event and return visit event can be the same or different. Key metrics include same-day retention, Day 1 retention, Day 2 retention, and Day 3 retention rates.

Implementation: The article details the offline architecture using Hive for retention calculations, including configuration, computation, storage, and display layers. It provides SQL implementation examples showing how to calculate retention data across multiple days.

ClickHouse Optimization: To address issues of slow report generation and high resource consumption, the article proposes using ClickHouse for faster queries. Key features utilized include efficient data import and compression, columnar storage and distributed computing for incremental calculations, caching mechanisms for improved query efficiency, and SQL support for flexible analysis.

The ClickHouse implementation uses the retention() function which takes conditions as parameters (1-32 UInt8 type arguments) to indicate whether events meet specific criteria. The article provides complete SQL examples demonstrating how to calculate 3-day retention data for specified time ranges in a single query.

Conclusion: Retention analysis is a core model in the data analysis toolkit, widely used for evaluating product user experience and attractiveness. It can be combined with other analysis methods like path analysis, funnel analysis, and event analysis for deeper insights into user behavior patterns.

user behaviorData ModelingClickHouseHiveSQL Optimizationretention analysisbig data analyticscohort analysis
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.