Big Data 37 min read

Unlocking E‑Commerce Insights: How Python & SQL Reveal User Behavior and Boost Sales

This article analyzes a JD e‑commerce dataset using Python and MySQL to calculate key metrics such as PV, UV, conversion rates, attrition, daily activity, hourly trends, user‑behavior funnels, purchase intervals, retention rates, product sales, and RFM segmentation, and then offers data‑driven recommendations to improve traffic, conversion, and user loyalty.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Unlocking E‑Commerce Insights: How Python & SQL Reveal User Behavior and Boost Sales

Project Background

The project analyzes JD e‑commerce operational data to understand user shopping behavior and provide decision‑making support. Both MySQL and Python are used for metric calculations to suit different data‑analysis environments.

Dataset Introduction

The dataset contains five files covering user data from 2018‑02‑01 to 2018‑04‑15, already anonymized. The behavior data table includes five fields, as shown in the figure.

Dataset fields
Dataset fields

Data Cleaning

# Import Python modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
plt.style.use('ggplot')
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# Read data (use reduce_mem_usage for large files)
user_action = pd.read_csv('jdata_action.csv')
# Filter data from 2018‑03‑30 to 2018‑04‑15 (April data contains add‑to‑cart actions)
user_data = user_action[(user_action['action_time'] > '2018-03-30') & (user_action['action_time'] < '2018-04-15')]
user_data.to_csv('user_data.csv', sep=',')
# Inspect data types
behavior = pd.read_csv('user_data.csv', index_col=0)
print(behavior.info())
# Check for missing values
print(behavior.isnull().sum())
# Split timestamp into date, hour, and weekday
behavior['date'] = pd.to_datetime(behavior['action_time']).dt.date
behavior['hour'] = pd.to_datetime(behavior['action_time']).dt.hour
behavior['weekday'] = pd.to_datetime(behavior['action_time']).dt.weekday_name
# Drop unrelated column
behavior = behavior.drop('module_id', axis=1)
# Map numeric type to descriptive label
behavior_type = {1:'pv',2:'pay',3:'fav',4:'comm',5:'cart'}
behavior['type'] = behavior['type'].apply(lambda x: behavior_type[x])
behavior.reset_index(drop=True, inplace=True)

Analysis Model Construction Metrics

1. Traffic Metrics

# Total page views (pv)
pv = behavior[behavior['type'] == 'pv']['user_id'].count()
# Unique visitors (uv)
uv = behavior['user_id'].nunique()
# Users who made a purchase
user_pay = behavior[behavior['type'] == 'pay']['user_id'].unique()
# Daily average page views
pv_per_day = pv / behavior['date'].nunique()
# Average page views per user
pv_per_user = pv / uv
# Purchase page views
pv_pay = behavior[behavior['user_id'].isin(user_pay)]['type'].value_counts().pv
# Purchase user proportion
user_pay_rate = len(user_pay) / uv
# Purchase page view proportion
pv_pay_rate = pv_pay / pv
# Average purchase page views per paying user
pv_per_buy_user = pv_pay / len(user_pay)
print(f'总访问量为 {pv}')
print(f'总访客数为 {uv}')
print(f'消费用户数为 {len(user_pay)}')
print(f'消费用户访问量为 {pv_pay}')
print(f'日均访问量为 {pv_per_day:.3f}')
print(f'人均访问量为 {pv_per_user:.3f}')
print(f'消费用户人均访问量为 {pv_per_buy_user:.3f}')
print(f'消费用户数占比为 {user_pay_rate*100:.3f}%')
print(f'消费用户访问量占比为 {pv_pay_rate*100:.3f}%')

Output:

总访问量为 6229177
总访客数为 728959
消费用户数为 395874
消费用户访问量为 3918000
日均访问量为 389323.562
人均访问量为 8.545
消费用户人均访问量为 9.897
消费用户数占比为 54.307%
消费用户访问量占比为 62.898%

Interpretation: Paying users spend more time on the site, indicating a decent shopping experience. New users need clearer product guidance to increase conversion.

2. User Purchase Frequency

# Total purchase count per user
total_buy_count = behavior[behavior['type']=='pay'].groupby('user_id')['type'].count().to_frame(name='total')
# Top 10 buyers
topbuyer10 = total_buy_count.sort_values(by='total', ascending=False).head(10)
# Repurchase rate (users with >=2 purchases)
re_buy_rate = total_buy_count[total_buy_count['total']>=2].count() / total_buy_count.count()
print(f'复购率为 {re_buy_rate*100:.3f}%')

Output: 复购率为 13.419% Interpretation: Most users purchase ≤6 times; increasing promotions and improving the shopping experience can raise purchase frequency.

3. Temporal Distribution of User Behavior

# Daily active users (any action)
daily_active_user = behavior.groupby('date')['user_id'].nunique()
# Daily paying users
daily_buy_user = behavior[behavior['type']=='pay'].groupby('date')['user_id'].nunique()
# Daily paying user proportion
proportion_of_buyer = daily_buy_user / daily_active_user
# Daily total purchase count
daily_buy_count = behavior[behavior['type']=='pay'].groupby('date')['type'].count()
# Daily average purchases per paying user
consumption_per_buyer = daily_buy_count / daily_buy_user
# Hourly page views and visitors
pv_hourly = behavior[behavior['type']=='pv'].groupby('hour')['user_id'].count()
uv_hourly = behavior.groupby('hour')['user_id'].nunique()
# Hourly user operation distribution (excluding page views)
type_detail_hour = pd.pivot_table(columns='type', index='hour', data=behavior, aggfunc=np.size, values='user_id')
# Weekly operation distribution (excluding page views)
type_detail_weekday = pd.pivot_table(columns='type', index='weekday', data=behavior, aggfunc=np.size, values='user_id')
type_detail_weekday = type_detail_weekday.reindex(['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])

4. User Behavior Conversion Funnel

# Overall behavior distribution
type_dis = behavior['type'].value_counts().reset_index()
type_dis['rate'] = round(type_dis['type'] / type_dis['type'].sum(), 3)

Result: 82.6% of actions are page views, only 6.4% are payments.

# Funnel: PV → Pay (excluding users who added to cart or favorited)
fav_cart_list = set(df_fav_uid) | set(df_cart_uid)
pv_pay_df = pd.merge(df_pv, df_pay, on=['user_id','sku_id'], suffixes=('_pv','_pay'))
pv_pay_df = pv_pay_df[~pv_pay_df['user_id'].isin(fav_cart_list) & (pv_pay_df['action_time_pv'] < pv_pay_df['action_time_pay'])]
uv = behavior['user_id'].nunique()
pv_pay_num = pv_pay_df['user_id'].nunique()
pv_pay_data = pd.DataFrame({'type':['浏览','付款'], 'num':[uv, pv_pay_num]})
pv_pay_data['conversion_rates'] = round(pv_pay_data['num'] / pv_pay_data['num'][0], 4) * 100
# Funnel: PV → Cart → Pay
pv_cart_df = pd.merge(df_pv, df_cart, on=['user_id','sku_id'], suffixes=('_pv','_cart'))
pv_cart_df = pv_cart_df[pv_cart_df['action_time_pv'] < pv_cart_df['action_time_cart']]
pv_cart_df = pv_cart_df[~pv_cart_df['user_id'].isin(df_fav_uid)]
pv_cart_pay_df = pd.merge(pv_cart_df, df_pay, on=['user_id','sku_id'])
pv_cart_pay_df = pv_cart_pay_df[pv_cart_pay_df['action_time_cart'] < pv_cart_pay_df['action_time']]
uv = behavior['user_id'].nunique()
pv_cart_num = pv_cart_df['user_id'].nunique()
pv_cart_pay_num = pv_cart_pay_df['user_id'].nunique()
pv_cart_pay_data = pd.DataFrame({'type':['浏览','加购','付款'], 'num':[uv, pv_cart_num, pv_cart_pay_num]})
pv_cart_pay_data['conversion_rates'] = round(pv_cart_pay_data['num'] / pv_cart_pay_data['num'][0], 4) * 100
# Funnel: PV → Fav → Pay
pv_fav_df = pd.merge(df_pv, df_fav, on=['user_id','sku_id'], suffixes=('_pv','_fav'))
pv_fav_df = pv_fav_df[pv_fav_df['action_time_pv'] < pv_fav_df['action_time_fav']]
pv_fav_df = pv_fav_df[~pv_fav_df['user_id'].isin(df_cart_uid)]
pv_fav_pay_df = pd.merge(pv_fav_df, df_pay, on=['user_id','sku_id'])
pv_fav_pay_df = pv_fav_pay_df[pv_fav_pay_df['action_time_fav'] < pv_fav_pay_df['action_time']]
uv = behavior['user_id'].nunique()
pv_fav_num = pv_fav_df['user_id'].nunique()
pv_fav_pay_num = pv_fav_pay_df['user_id'].nunique()
pv_fav_pay_data = pd.DataFrame({'type':['浏览','收藏','付款'], 'num':[uv, pv_fav_num, pv_fav_pay_num]})
pv_fav_pay_data['conversion_rates'] = round(pv_fav_pay_data['num'] / pv_fav_pay_data['num'][0], 4) * 100
# Funnel: PV → Fav → Cart → Pay
pv_fav = pd.merge(df_pv, df_fav, on=['user_id','sku_id'], suffixes=('_pv','_fav'))
pv_fav = pv_fav[pv_fav['action_time_pv'] < pv_fav['action_time_fav']]
pv_fav_cart = pd.merge(pv_fav, df_cart, on=['user_id','sku_id'])
pv_fav_cart = pv_fav_cart[pv_fav_cart['action_time_fav'] < pv_fav_cart['action_time']]
pv_fav_cart_pay = pd.merge(pv_fav_cart, df_pay, on=['user_id','sku_id'], suffixes=('_cart','_pay'))
pv_fav_cart_pay = pv_fav_cart_pay[pv_fav_cart_pay['action_time_cart'] < pv_fav_cart_pay['action_time_pay']]
uv = behavior['user_id'].nunique()
pv_fav_n = pv_fav['user_id'].nunique()
pv_fav_cart_n = pv_fav_cart['user_id'].nunique()
pv_fav_cart_pay_n = pv_fav_cart_pay['user_id'].nunique()
pv_fav_cart_pay_data = pd.DataFrame({'type':['浏览','收藏','加购','付款'], 'num':[uv, pv_fav_n, pv_fav_cart_n, pv_fav_cart_pay_n]})
pv_fav_cart_pay_data['conversion_rates'] = round(pv_fav_cart_pay_data['num'] / pv_fav_cart_pay_data['num'][0], 4) * 100

5. Purchase Time Interval by Path

# PV → Cart → Pay interval (hours)
pcp_interval = pv_cart_pay_df.groupby(['user_id','sku_id']).apply(lambda x: (x['action_time'].min() - x['action_time_cart'].min())).reset_index(name='interval')
pcp_interval['interval'] = (pcp_interval['interval'].dt.total_seconds() / 3600).apply(math.ceil)
# PV → Fav → Pay interval (hours)
pfp_interval = pv_fav_pay_df.groupby(['user_id','sku_id']).apply(lambda x: (x['action_time'].min() - x['action_time_fav'].min())).reset_index(name='interval')
pfp_interval['interval'] = (pfp_interval['interval'].dt.total_seconds() / 3600).apply(math.ceil)

Both paths show most users complete payment within 4 hours, indicating an efficient checkout flow.

6. User Retention Analysis

# Define key dates
first_day = datetime.strptime('2018-03-30', '%Y-%m-%d').date()
fifth_day = datetime.strptime('2018-04-03', '%Y-%m-%d').date()
tenth_day = datetime.strptime('2018-04-08', '%Y-%m-%d').date()
fifteenth_day = datetime.strptime('2018-04-13', '%Y-%m-%d').date()
# New users on first day
user_num_first = behavior[behavior['date']==first_day]['user_id'].unique()
# Retained users on later days
user_num_fifth = behavior[behavior['date']==fifth_day]['user_id'].unique()
user_num_tenth = behavior[behavior['date']==tenth_day]['user_id'].unique()
user_num_fifteenth = behavior[behavior['date']==fifteenth_day]['user_id'].unique()
# Retention rates
fifth_day_retention_rate = len(set(user_num_first) & set(user_num_fifth)) / len(user_num_first)
tenth_day_retention_rate = len(set(user_num_first) & set(user_num_tenth)) / len(user_num_first)
fifteenth_day_retention_rate = len(set(user_num_first) & set(user_num_fifteenth)) / len(user_num_first)
retention_rate = pd.DataFrame({
    'n_day_after': [5,10,15],
    'Rate': [fifth_day_retention_rate, tenth_day_retention_rate, fifteenth_day_retention_rate]
})

Result: 5‑day retention 22.81%, 15‑day retention 17.44% – relatively stable, indicating good product quality but room for improvement via targeted messaging and incentives.

7. Product Sales Analysis

# Total number of SKUs
total_skus = behavior['sku_id'].nunique()
# Average operations before purchase per SKU
sku_df = behavior[behavior['sku_id'].isin(behavior[behavior['type']=='pay']['sku_id'].unique())].groupby('sku_id')['type'].value_counts().unstack(fill_value=0)
sku_df['total'] = sku_df.sum(axis=1)
sku_df['avg_beha'] = round(sku_df['total'] / sku_df['pay'], 2)
# SKU sales ranking (sales > 1000)
sku_num = behavior[behavior['type']=='pay'].groupby('sku_id')['type'].count().reset_index(name='total')
top_skus = sku_num[sku_num['total']>1000].sort_values(by='total', ascending=False)

Interpretation: Products in the lower‑right quadrant have many operations but few sales (high‑price items); upper‑right quadrant shows high‑frequency, high‑sales items (fast‑moving consumer goods). Promoting top‑selling SKUs can increase average items per user.

8. RFM User Segmentation

# RFM (only Recency and Frequency, no monetary value)
buy_group = behavior[behavior['type']=='pay'].groupby('user_id')['date']
final_day = datetime.strptime('2018-04-14', '%Y-%m-%d').date()
# Recency (days since last purchase)
recent_buy_time = buy_group.apply(lambda x: (final_day - x.max()).days).reset_index(name='recent')
# Frequency (purchase count in period)
buy_freq = buy_group.count().reset_index(name='freq')
RFM = pd.merge(recent_buy_time, buy_freq, on='user_id')
# Quantile split into 2 groups for each dimension
RFM['R'] = pd.qcut(RFM['recent'], 2, labels=[1,0])  # lower recency = 1
RFM['F'] = pd.qcut(RFM['freq'].rank(method='first'), 2, labels=[0,1])  # higher freq = 1
RFM['RFM'] = RFM['R'].astype(int).astype(str) + RFM['F'].astype(int).astype(str)
segment_map = {'01':'重要保持客户','11':'重要价值客户','10':'重要挽留客户','00':'一般发展客户'}
RFM['User_Segment'] = RFM['RFM'].map(segment_map)
segment_counts = RFM['User_Segment'].value_counts().reset_index(name='count')
segment_counts['Rate'] = segment_counts['count'] / segment_counts['count'].sum()

Result: Segments are relatively balanced; focus should be on increasing the proportion of “重要价值客户” and reducing “一般发展客户”. Tailored strategies include premium benefits for high‑value users, re‑engagement campaigns for dormant users, and loyalty incentives for regular shoppers.

Conclusion

The analysis reveals that:

Increasing channel promotion and precise audience targeting can raise PV/UV and conversion.

Optimizing product search, recommendation, and checkout flow will improve purchase frequency and reduce attrition.

Retention is stable but can be boosted with periodic incentives, sign‑in rewards, and personalized communication.

RFM segmentation enables focused marketing, allocating resources to the most valuable user groups.

e-commercePythonSQLuser behaviorData AnalysisretentionRFMconversion funnel
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.