Practical Feature Engineering and Optimization Tips for a Data Mining Competition
The author shares hands‑on experiences from a data‑mining contest, covering categorical feature handling, sliding‑window statistics, feature combination strategies, pandas performance tricks, file organization, and using nohup for remote Linux execution to improve model training efficiency.
Hello everyone, I am Fang Kexing (hang) from Jinan University. This competition was my first exposure to data mining and machine learning, and I learned Python for it, having previously submitted results using Matlab.
About Features
A) Most provided features are categorical; using one‑hot encoding or statistical summaries yields better performance, allowing the original categorical columns to be dropped.
B) You can extract finer details such as province and city from the hometown field, though the dataset already encodes hometown at the city level; also consider binning clickTime.
C) Time features are tricky; applying a sliding‑window statistic effectively incorporates temporal information.
Sliding‑Window Approach
I start from day 26, aggregating the previous 7 days to compute ClickNum, ChangeNum, and historical conversion rates (the latter must be derived via sliding windows). Days 16‑17 show abnormal spikes, so I begin at day 26. For the final stage I only use days 26, 27, 28, and 31.
D) Duplicate clicks are a strong feature; because the 31 days are continuous, early test‑set clicks may be repeats from the previous training day, so append data before extracting duplicate‑click features.
E) Exhaustive combination of all features is unnecessary and impractical.
Feature Grouping
I split features into four groups and combine them with identifiers: userID with groups 2‑4, appID with groups 1,3,4, and positionID with groups 1,2,4, achieving decent results.
F) In the installedapps table, appPlatform=2 has zero records, leading to many missing values; I have not yet figured out how to exploit this.
Algorithmic Issues
A) The large final training set makes runtime long; most of the delay is due to unoptimized code. Prefer row‑wise operations in pandas, which handle tens of millions of rows efficiently.
B) Iterating with iterrows() is slow; converting the DataFrame to a dictionary can reduce a 10‑hour iteration to a few minutes.
File Management
I organize the project with a read‑only final dataset (chmod ‑R 555) to protect it, an intermediate data folder for temporary results, and a code folder containing scripts such as ad.py that preprocess each table.
Running on Remote Linux
When training runs for many hours, use nohup python ***.py & to keep the process alive after disconnecting. Monitor progress by opening nohup.out with vim nohup.out or tail nohup.out.
Short sharing ends here – wish everyone good results!
For more details, visit the official contest site: http://algo.tpai.qq.com and follow the TSA‑Contest public account for updates and gifts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Advertising Technology
Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
