Unlocking Powerful Features: A Deep Dive into Tianchi’s Repeat Purchase Prediction
This tutorial walks through the complete feature‑engineering pipeline for the Alibaba Tianchi “Tmall User Repeat Purchase Prediction” competition, covering data acquisition, memory‑efficient preprocessing, multi‑entity feature construction, statistical aggregations, text vectorisation, embedding generation and stacking‑based model features, all illustrated with Python code and diagrams.
Introduction
There is a widely quoted saying in the industry: “Data and features determine the upper bound of machine learning, while models and algorithms only approximate that bound.” This highlights the critical role of feature engineering in machine learning. In this article we explore feature engineering in a real‑world commercial scenario using the “Tmall User Repeat Purchase Prediction” case from the book *Alibaba Cloud Tianchi Competition Analysis – Machine Learning*.
Learning Prerequisites
(1) The feature‑engineering portion of this article is based on the second competition in the referenced book.
(2) The related data can be downloaded for free after registering for the Tmall User Repeat Purchase Prediction competition on the Alibaba Cloud Tianchi platform.
(3) Alibaba Cloud Tianchi provides a free Jupyter Notebook environment with a P100 GPU and unlimited CPU/GPU usage (5‑hour sessions, repeatable).
1. Dataset Introduction
After downloading the dataset, the files have the following meanings:
train_format1.csv contains the training data; the last column is label. test_format1.csv contains the test data; the last column is prob (the prediction target).
user_log_format1.csv stores user behavior logs.
user_info_format1.csv stores basic user information.
2. Feature Construction
The competition data revolves around three entities: users, shops, and merchants. Feature construction therefore focuses on these entities and can be divided into the following parts:
Examples include user‑shop interaction features, shop‑level attributes, and user‑level purchase patterns.
3. Feature Extraction Steps
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import gc
from collections import Counter
import warnings
warnings.filterwarnings("ignore")
%matplotlib inlineWe then read the data:
test_data = pd.read_csv('./data_format1/test_format1.csv')
train_data = pd.read_csv('./data_format1/train_format1.csv')
user_info = pd.read_csv('./data_format1/user_info_format1.csv')
user_log = pd.read_csv('./data_format1/user_log_format1.csv')To reduce memory usage we define a helper function:
def reduce_mem_usage(df, verbose=True):
start_mem = df.memory_usage().sum() / 1024**2
numerics = ['int16','int32','int64','float16','float32','float64']
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
if verbose:
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
return dfWe concatenate train and test, merge user information, and free unused objects:
all_data = train_data.append(test_data)
all_data = all_data.merge(user_info, on=['user_id'], how='left')
del train_data, test_data, user_info
gc.collect()Next we transform the user log into a single concatenated field per user:
# Sort logs
user_log = user_log.sort_values(['user_id','time_stamp'])
# Function to join list elements with a space
list_join_func = lambda x: " ".join([str(i) for i in x])
agg_dict = {
'item_id': list_join_func,
'cat_id': list_join_func,
'seller_id': list_join_func,
'brand_id': list_join_func,
'time_stamp': list_join_func,
'action_type': list_join_func
}
rename_dict = {
'item_id': 'item_path',
'cat_id': 'cat_path',
'seller_id': 'seller_path',
'brand_id': 'brand_path',
'time_stamp': 'time_stamp_path',
'action_type': 'action_type_path'
}
def merge_list(df_ID, join_columns, df_data, agg_dict, rename_dict):
df_data = df_data.groupby(join_columns).agg(agg_dict).reset_index().rename(columns=rename_dict)
df_ID = df_ID.merge(df_data, on=join_columns, how='left')
return df_ID
all_data = merge_list(all_data, 'user_id', user_log, agg_dict, rename_dict)
del user_log
gc.collect()We then define a suite of statistical aggregation functions:
def cnt_(x):
try:
return len(x.split(' '))
except:
return -1
def nunique_(x):
try:
return len(set(x.split(' ')))
except:
return -1
def max_(x):
try:
return np.max([float(i) for i in x.split(' ')])
except:
return -1
def min_(x):
try:
return np.min([float(i) for i in x.split(' ')])
except:
return -1
def std_(x):
try:
return np.std([float(i) for i in x.split(' ')])
except:
return -1
def most_n(x, n):
try:
return Counter(x.split(' ')).most_common(n)[n-1][0]
except:
return -1
def most_n_cnt(x, n):
try:
return Counter(x.split(' ')).most_common(n)[n-1][1]
except:
return -1Using these helpers we generate concrete features, for example on the seller_path column:
all_data_test = all_data.head(2000)
all_data_test = user_cnt(all_data_test, 'seller_path', 'user_cnt')
all_data_test = user_nunique(all_data_test, 'seller_path', 'seller_nunique')
all_data_test = user_nunique(all_data_test, 'cat_path', 'cat_nunique')
all_data_test = user_nunique(all_data_test, 'brand_path', 'brand_nunique')
all_data_test = user_nunique(all_data_test, 'item_path', 'item_nunique')
all_data_test = user_nunique(all_data_test, 'time_stamp_path', 'time_stamp_nunique')
all_data_test = user_nunique(all_data_test, 'action_type_path', 'action_type_nunique')
# Most frequent entities
all_data_test = user_most_n(all_data_test, 'seller_path', 'seller_most_1', n=1)
all_data_test = user_most_n(all_data_test, 'cat_path', 'cat_most_1', n=1)
all_data_test = user_most_n(all_data_test, 'brand_path', 'brand_most_1', n=1)
all_data_test = user_most_n(all_data_test, 'action_type_path', 'action_type_1', n=1)4. Text Vectorisation with CountVectorizer and TF‑IDF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from scipy import sparse
tfidfVec = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1,1), max_features=100)
columns_list = ['seller_path']
for i, col in enumerate(columns_list):
tfidfVec.fit(all_data_test[col])
data_ = tfidfVec.transform(all_data_test[col])
if i == 0:
data_cat = data_
else:
data_cat = sparse.hstack((data_cat, data_))5. Embedding Features
import gensim
model = gensim.models.Word2Vec(
all_data_test['seller_path'].apply(lambda x: x.split(' ')),
size=100,
window=5,
min_count=5,
workers=4)
def mean_w2v_(x, model, size=100):
try:
i = 0
for word in x.split(' '):
if word in model.wv.vocab:
i += 1
if i == 1:
vec = np.zeros(size)
vec += model.wv[word]
return vec / i
except:
return np.zeros(size)
def get_mean_w2v(df_data, column, model, size):
data_array = []
for _, row in df_data.iterrows():
w2v = mean_w2v_(row[column], model, size)
data_array.append(w2v)
return pd.DataFrame(data_array)
df_embedding = get_mean_w2v(all_data_test, 'seller_path', model, 100)
df_embedding.columns = ['embedding_' + str(i) for i in df_embedding.columns]6. Stacking Classification Features
from sklearn.model_selection import KFold
folds = 5
seed = 1
kf = KFold(n_splits=5, shuffle=True, random_state=0)
# Base models (LightGBM and XGBoost) are assumed to be defined as lgb_clf and xgb_clf
clf_list = [lgb_clf, xgb_clf]
train_data_list = []
test_data_list = []
for clf in clf_list:
train_feat, test_feat, clf_name = clf(x_train, y_train, x_valid, kf)
train_data_list.append(train_feat)
test_data_list.append(test_feat)
train_stacking = np.concatenate(train_data_list, axis=1)
test_stacking = np.concatenate(test_data_list, axis=1)Sample output of the stacking process shows validation multi‑logloss values and final averaged scores:
[1] valid_0's multi_logloss: 0.240875
[2] valid_0's multi_logloss: 0.240675
[226] train-mlogloss:0.123211 eval-mlogloss:0.226966
Stopping. Best iteration:
[126] train-mlogloss:0.172219 eval-mlogloss:0.218029
xgb now score is: [2.4208301225770263, 2.2433633135072886, 2.5190920314658434, 2.4902898448798805, 2.5797977298125625]
xgb_score_mean: 2.4506746084485203Conclusion
The feature‑engineering pipeline presented above demonstrates how to transform raw e‑commerce logs into rich, high‑dimensional representations suitable for machine‑learning models. For deeper insights, additional algorithms, and more detailed explanations, refer to the book *Alibaba Cloud Tianchi Competition Analysis – Machine Learning*.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
