Implementing Daily User Metrics (DNU, DAU, Retention) with Shell Scripts and MySQL Partitioned Tables
This guide explains how to collect, store, clean, and aggregate daily user metrics—including new users, active users, and multi‑day retention—using shell scripts, cron jobs, MySQL partitioned tables, and stored procedures, suitable for small‑to‑medium sized products.
Background
After a product launch, operations need to track user data and calculate daily new users (DNU), daily active users (DAU), and retention rates for each cohort over subsequent days.
Scope of Use
The solution fits projects with daily active users under one million, ideal for small teams or solo developers who handle both backend development and operations.
Tools and Technical Points
Shell Scripts
Basic knowledge of shell scripting and crontab is required.
MySQL
Proficiency with SQL, stored procedures, and partitioned tables is needed.
Implementation Process
1. Goal
Aggregate raw logs into a single table that can be queried with one SELECT statement to obtain daily new users, active users, and retention percentages, while allowing re‑runs without affecting existing data.
2. Data Collection
Assume three dimensions: date, country, and version. Log entries are formatted as tab‑separated fields and written to a file that MySQL can load.
2022-02-09 13:14:15||aaaaaa||CN||100
2022-02-09 13:14:16||bbbbbb||US||100
2022-02-09 13:14:17||cccccc||NL||100When data volume grows, split logs by hour and load each part separately. Use a partitioned table to delete old data efficiently.
3. Data Import
Shell script steps (run daily via cron):
Delete the previous day's partition.
Delete the partition for the current execution date (useful for re‑runs).
Create a new partition for the execution date.
Load the log file into MySQL with the LOAD DATA LOCAL INFILE command.
drop_sql="alter table st_base drop partition pxxxxxxxx"
add_sql="alter table st_base add partition (partition pxxxxxxxx values in (xxxxxxxx) engine=innodb)"
mysql -u${username} -p${password} -D${database} -e "${drop_sql}"
mysql -u${username} -p${password} -D${database} -e "${add_sql}"
log_file=xxxx
load_sql="load data local infile '${log_file}' ignore into table st_base fields terminated by '||' lines terminated by '\n' (ts,device_id,country,version) set dd='${dd}'"
mysql -u${username} -p${password} -D${database} -e "${load_sql}"Schedule the script at 00:10 each day with crontab:
10 0 * * * /path/to/job/st_base.sh4. Data Cleaning
Use temporary tables and stored procedures to transform raw data, deduplicate, and populate three main tables:
tmp_base – temporary table for type conversion.
total_base – one row per unique device.
flow_base – daily flow table with registration date ( rdt ) for retention calculations.
CREATE PROCEDURE `sp_data_cleaning`(IN v_dt VARCHAR(10))
BEGIN
TRUNCATE TABLE tmp_base;
INSERT INTO tmp_base(`dt`,`device_id`,`country`,`version`)
SELECT v_dt, `device_id`, `country`, `version`
FROM `st_base` WHERE `dd` = REPLACE(v_dt,'-','');
INSERT IGNORE INTO total_base(`dt`,`device_id`,`country`,`version`)
SELECT `dt`,`device_id`,`country`,`version` FROM tmp_base;
/* partition handling omitted for brevity */
INSERT IGNORE INTO flow_base(`dt`,`device_id`,`country`,`version`,`rdt`,`dd`)
SELECT v_dt, t1.`device_id`, t1.`country`, t1.`version`, t2.`dt`, REPLACE(v_dt,'-','')
FROM tmp_base t1 LEFT OUTER JOIN total_base t2 ON t1.`device_id`=t2.`device_id`;
END5. Data Aggregation
Create two reporting tables:
rpt_base_active – stores daily new users and active users.
rpt_base – stores 7‑day retention counts (d0‑d7).
CREATE TABLE `rpt_base_active` (
`id` BIGINT NOT NULL AUTO_INCREMENT,
`dt` DATE NOT NULL,
`country` VARCHAR(8),
`version` INT NOT NULL,
`new_users` SMALLINT NOT NULL DEFAULT 0,
`active_users` SMALLINT NOT NULL DEFAULT 0,
PRIMARY KEY (`id`),
KEY `index1` (`dt`),
KEY `index3` (`country`,`version`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='用户新增活跃表';
CREATE TABLE `rpt_base` (
`id` BIGINT NOT NULL AUTO_INCREMENT,
`dt` DATE NOT NULL,
`country` VARCHAR(8),
`version` INT NOT NULL,
`d0` SMALLINT NOT NULL DEFAULT 0,
`d1` SMALLINT NOT NULL DEFAULT 0,
`d2` SMALLINT NOT NULL DEFAULT 0,
`d3` SMALLINT NOT NULL DEFAULT 0,
`d4` SMALLINT NOT NULL DEFAULT 0,
`d5` SMALLINT NOT NULL DEFAULT 0,
`d6` SMALLINT NOT NULL DEFAULT 0,
`d7` SMALLINT NOT NULL DEFAULT 0,
PRIMARY KEY (`id`),
KEY `index1` (`dt`),
KEY `index3` (`country`,`version`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='用户留存表';Aggregate data with stored procedures that compute the required dates, delete previous results, and insert fresh aggregates.
CREATE PROCEDURE `sp_rpt_base`(IN v_dt CHAR(10))
BEGIN
/* compute d0‑d7 dates */
DELETE FROM rpt_base_active WHERE `dt` = v_dt;
INSERT INTO rpt_base_active(`dt`,`country`,`version`,`new_users`,`active_users`)
SELECT `dt`,`country`,`version`,
SUM(IF(`dt`=`rdt`,1,0)) AS new_users,
COUNT(*) AS active_users
FROM flow_base WHERE dt=v_dt
GROUP BY `dt`,`country`,`version`;
DELETE FROM rpt_base WHERE `dt` IN (d0,d1,d2,d3,d4,d5,d6,d7);
INSERT INTO rpt_base(`dt`,`country`,`version`,`d0`,`d1`,`d2`,`d3`,`d4`,`d5`,`d6`,`d7`)
SELECT t1.`rdt`, t1.`country`, t1.`version`,
SUM(CASE WHEN `rdays`=0 THEN `users` ELSE 0 END) AS d0,
SUM(CASE WHEN `rdays`=1 THEN `users` ELSE 0 END) AS d1,
SUM(CASE WHEN `rdays`=2 THEN `users` ELSE 0 END) AS d2,
SUM(CASE WHEN `rdays`=3 THEN `users` ELSE 0 END) AS d3,
SUM(CASE WHEN `rdays`=4 THEN `users` ELSE 0 END) AS d4,
SUM(CASE WHEN `rdays`=5 THEN `users` ELSE 0 END) AS d5,
SUM(CASE WHEN `rdays`=6 THEN `users` ELSE 0 END) AS d6,
SUM(CASE WHEN `rdays`=7 THEN `users` ELSE 0 END) AS d7
FROM a_flow_base t1
GROUP BY t1.`rdt`, t1.`country`, t1.`version`;
ENDFull Process Wrapper
Combine all steps into a single procedure that can be called daily:
CREATE PROCEDURE `sp_user`(IN v_dt CHAR(10))
BEGIN
CALL sp_tmp_base(v_dt);
CALL sp_data_cleaning(v_dt);
CALL sp_a_flow_base(v_dt);
CALL sp_rpt_base(v_dt);
ENDFinal Notes
The flow_base table should retain enough days to cover the longest retention window (e.g., keep 8 days for 7‑day retention). If processing takes longer than a couple of hours, consider a Hadoop‑based solution.
Note: Index design for the reporting tables should be tuned to the specific queries used by the operations team.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.