How to Build E‑Commerce User Profiles with Python: A Step‑by‑Step Guide
This tutorial walks you through a real‑world example of constructing e‑commerce user profiles using Python, pandas, and matplotlib, covering data loading, preprocessing, feature engineering, behavior tagging, activity analysis, and RFM segmentation with clear code snippets and visualizations.
导读: 本文以真实案例,手把手教你搭建电商系统的用户画像。
1. 导入库
# 导入所需的库
%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from datetime import datetime%matplotlib inline: 魔法函数,使绘图自动显示。
datetime: 用于时间处理。
2. 数据准备
读取原始数据并抽取 20% 样本以提升运行效率:
# 导入数据集
df_orginal = pd.read_csv('./taobao_persona.csv')
# 抽取部分数据
df = df_orginal.sample(frac=0.2, random_state=None)3. 数据预处理
检查缺失值并删除缺失严重的 user_geohash 列;将 time 拆分为 date 与 hour ,并将 hour 分段为凌晨、上午、中午、下午、晚上:
# 检查缺失值
df.isnull().any().sum()
# 删除 user_geohash 列
df.drop('user_geohash', axis=1, inplace=True)
# 拆分时间
df['date'] = df['time'].str[0:10]
df['time'] = df['time'].str[11:]
df['time'] = df['time'].astype(int)
# 分段
df['hour'] = pd.cut(df['time'], bins=[-1,5,10,13,18,24],
labels=['凌晨','上午','中午','下午','晚上'])4. 构建用户标签表
创建仅包含唯一 user_id 的标签表:
users = df['user_id'].unique()
labels = pd.DataFrame(users, columns=['user_id'])5. 构建用户行为标签
5.1 浏览时间段分析
统计每个用户在各时间段的浏览次数,取最大次数对应的时间段作为标签:
# 统计浏览次数
time_browse = df[df['behavior_type']==1].groupby(['user_id','hour']).item_id.count().reset_index()
time_browse.rename(columns={'item_id':'hour_counts'}, inplace=True)
# 取最大浏览次数的时间段
time_browse_max = time_browse.groupby('user_id')['hour_counts'].max().reset_index()
time_browse_max.rename(columns={'hour_counts':'read_counts_max'}, inplace=True)
time_browse = pd.merge(time_browse, time_browse_max, how='left', on='user_id')
time_browse_hour = time_browse.loc[time_browse['hour_counts']==time_browse['read_counts_max'], 'hour']
.time_browse_hour = time_browse_hour.groupby('user_id').agg(lambda x: ','.join(x)).reset_index()
labels = pd.merge(labels, time_browse_hour, how='left', on='user_id')
labels.rename(columns={'hour':'time_browse'}, inplace=True)5.2 类目行为分析
分别统计浏览、收藏、加购、购买四种行为在各商品类目下的次数,并取每个用户最多的类目:
# 浏览类目
df_browse = df[df['behavior_type']==1][['user_id','item_id','item_category']]
# 收藏类目
df_collect = df[df['behavior_type']==2][['user_id','item_id','item_category']]
# 加购类目
df_cart = df[df['behavior_type']==3][['user_id','item_id','item_category']]
# 购买类目
df_buy = df[df['behavior_type']==4][['user_id','item_id','item_category']]
# 以浏览为例统计最多类目
df_cate_most_browse = df_browse.groupby(['user_id','item_category']).item_id.count().reset_index()
df_cate_most_browse.rename(columns={'item_id':'item_category_counts'}, inplace=True)
df_cate_most_browse_max = df_cate_most_browse.groupby('user_id')['item_category_counts'].max().reset_index()
df_cate_most_browse_max.rename(columns={'item_category_counts':'item_category_counts_max'}, inplace=True)
df_cate_most_browse = pd.merge(df_cate_most_browse, df_cate_most_browse_max, how='left', on='user_id')
# 合并到标签表
labels = pd.merge(labels, df_cate_most_browse[['user_id','item_category']], how='left', on='user_id')
labels.rename(columns={'item_category':'cate_most_browse'}, inplace=True)5.3 最近 30 天行为统计
统计近 30 天内的浏览、收藏、加购、购买次数以及活跃天数:
# 近30天购买次数
df_counts_30_buy = df[df['behavior_type']==4].groupby('user_id').item_id.count().reset_index()
labels = pd.merge(labels, df_counts_30_buy, how='left', on='user_id')
labels.rename(columns={'item_id':'counts_30_buy'}, inplace=True)
# 近30天加购次数
df_counts_30_cart = df[df['behavior_type']==3].groupby('user_id').item_id.count().reset_index()
labels = pd.merge(labels, df_counts_30_cart, how='left', on='user_id')
labels.rename(columns={'item_id':'counts_30_cart'}, inplace=True)
# 近30天活跃天数(去重 date)
counts_30_active = df.groupby('user_id')['date'].nunique().reset_index()
labels = pd.merge(labels, counts_30_active, how='left', on='user_id')
labels.rename(columns={'date':'counts_30_active'}, inplace=True)5.4 最近一次行为距今天数
计算浏览、加购、购买等最近一次行为距离当前日期的天数:
# 示例:最近一次浏览距今天数
days_browse = df[df['behavior_type']==1].groupby('user_id')['date'].max().apply(lambda x: (datetime.strptime('2014-12-19','%Y-%m-%d')-x).days)
labels = pd.merge(labels, days_browse.rename('days_browse'), how='left', on='user_id')5.5 最近两次购买间隔天数
# 计算两次购买之间的天数间隔
df_interval_buy = df[df['behavior_type']==4].groupby(['user_id','date']).item_id.count().reset_index()
interval_buy = df_interval_buy.groupby('user_id')['date'].apply(lambda x: x.sort_values().diff(1).dropna().head(1)).reset_index()
interval_buy['date'] = interval_buy['date'].apply(lambda x: x.days)
labels = pd.merge(labels, interval_buy.rename(columns={'date':'interval_buy'}), how='left', on='user_id')5.6 浏览未下单、加购未下单标记
# 浏览未下单
df_browse_buy = df[(df['behavior_type']==1)|(df['behavior_type']==4)][['user_id','item_id','behavior_type','time']]
browse_not_buy = pd.pivot_table(df_browse_buy, index=['user_id','item_id'], columns=['behavior_type'], values='time', aggfunc='count')
browse_not_buy.columns = ['browse','buy']
browse_not_buy.fillna(0, inplace=True)
browse_not_buy['browse_not_buy'] = 0
browse_not_buy.loc[(browse_not_buy['browse']>0) & (browse_not_buy['buy']==0), 'browse_not_buy'] = 1
browse_not_buy = browse_not_buy.groupby('user_id')['browse_not_buy'].sum().reset_index()
labels = pd.merge(labels, browse_not_buy, how='left', on='user_id')
labels['browse_not_buy'] = labels['browse_not_buy'].apply(lambda x: '是' if x>0 else '否')
# 加购未下单(同理)6. 用户属性标签构建
6.1 是否复购用户
buy_again = df[df['behavior_type']==4].groupby('user_id').item_id.count().reset_index()
buy_again.rename(columns={'item_id':'buy_again'}, inplace=True)
labels = pd.merge(labels, buy_again, how='left', on='user_id')
labels['buy_again'] = labels['buy_again'].fillna(-1)
labels['buy_again'] = labels['buy_again'].apply(lambda x: '是' if x>1 else ('否' if x==1 else '未购买'))6.2 访问活跃度划分
user_active_level = labels['counts_30_active'].value_counts().sort_index(ascending=False)
labels['user_active_level'] = '高'
labels.loc[labels['counts_30_active']<=16, 'user_active_level'] = '低'6.3 购买活跃度划分
buy_active_level = labels['counts_30_buy'].value_counts().sort_index(ascending=False)
labels['buy_active_level'] = '高'
labels.loc[labels['counts_30_buy']<=14, 'buy_active_level'] = '低'6.4 购买品类是否单一
buy_single = df[df['behavior_type']==4].groupby('user_id').item_category.nunique().reset_index()
buy_single.rename(columns={'item_category':'buy_single'}, inplace=True)
labels = pd.merge(labels, buy_single, how='left', on='user_id')
labels['buy_single'] = labels['buy_single'].fillna(-1)
labels['buy_single'] = labels['buy_single'].apply(lambda x: '是' if x>1 else ('否' if x==1 else '未购买'))6.5 RFM 模型分层
# 购买天数层级(最近一次购买距今天数)
labels['buy_days_level'] = '高'
labels.loc[labels['days_buy']>8, 'buy_days_level'] = '低'
# 合并层级
labels['rfm_value'] = labels['buy_active_level'].str.cat(labels['buy_days_level'])
def trans_value(x):
if x == '高高':
return '重要价值客户'
elif x == '低高':
return '重要深耕客户'
elif x == '高低':
return '重要唤回客户'
else:
return '即将流失客户'
labels['rfm'] = labels['rfm_value'].apply(trans_value)
labels.drop(['buy_days_level','rfm_value'], axis=1, inplace=True)通过上述步骤,得到每位用户的时间标签、类目偏好、近期行为统计、活跃度分层、复购情况以及 RFM 分层,为后续精准营销提供数据支撑。
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
