Artificial Intelligence 11 min read

Tencent Social Ads Data Mining Expert Q&A: Feature Engineering, Modeling, and Competition Insights

In a Q&A session, a Tencent social ads data mining expert addressed competition participants' questions on data delays, full‑set versus sliding‑window statistics, dataset authenticity, Bayesian smoothing, feature selection, handling missing values, large‑scale training, feature interactions, model stacking, online mini‑batch training, and provided reference resources.

Tencent Advertising Technology

Jun 15, 2017

Tencent Social Ads Data Mining Expert Q&A: Feature Engineering, Modeling, and Competition Insights

Today, a data mining expert from Tencent's Social Ads department held a Q&A session for competition participants in their study group (ID: 150522270), aiming to share knowledge and experience.

Q1: Why does data from the 16th appear with only a few rows? A: The log data was delayed and landed in the 17th day's directory; you can choose whether to use it.

Q2: Why does calculating historical conversion rates on the full set and using a sliding window both show effects, and isn’t full‑set calculation a data leakage? A: Full‑set statistics use past and future data, which are correlated, so both methods can show positive effects; however, using future data causes leakage and overestimates results.

Q3: Is the competition dataset completely real or sampled by user ID? A: It is real data, sampled randomly by appid and userid.

Q4: For Bayesian smoothing of conversion rates, should we use windowed processing or full‑set processing? Does full‑set cause data leakage? A: Consider whether the test set contains that information; using full‑set statistics from the training set is acceptable if the same feature extraction logic is applied locally.

Q5: Highly correlated features like appid and adid have almost identical statistics; should both be used or handled specially? A: This is a normal feature selection task; rely on local performance.

Q6: How to handle missing values for combined statistical features, e.g., using the mean? A: Refer to previous Kaggle competition experience; choose imputation methods based on feature distribution, and treat missing values as a special category when using models like XGBoost or Random Forest.

Q7: For large‑scale training sets with long training times, what are efficient ways to build a smaller local training/validation set? A: See the public article "Data Mining Competition Big Data Processing and Modeling Experience" for guidance.

Q8: If predicting day 31 data, is full‑set statistics feasible, or will it be rejected during defense for using unavailable data? A: This method does not trigger the "unavailable data" issue because day 31 data is not used during prediction; it may still cause potential leakage during training but can be acceptable if results are satisfactory.

Q9: How to discover effective feature cross combinations without exhaustive groupby? A: Use common statistical metrics or visual tools like boxplots.

Q10: Can Factorization Machines (FM) replace manual feature crossing and achieve the same effect as Logistic Regression with many crossed features? A: Simple cross features (e.g., x1 * x2) need not be handcrafted, but complex features (e.g., userID statistics under specific appID & positionID) may still require manual extraction.

Q11: How to select features effectively when low‑importance features may help and high‑importance ones may be redundant? A: Feature importance is hard to measure; consider wrapper or embedded methods that select features based on model performance.

Reference link

Q12: Is prediction typically real‑time or offline? If real‑time, can we use day‑31 data before the event? A: Prediction is real‑time; the model for day 31 is trained before day 31 arrives, so day‑31 data is unavailable during training, matching the competition setup.

Q13: Resources for Vowpal Wabbit supporting LR + high‑order feature interactions and online learning? A: Vowpal Wabbit GitHub and a third‑place solution using VW with high‑order features: GitHub repo .

Q14: References for model ensembling methods and parameter selection? A: Kaggle experience article link and a comprehensive survey Kaggle Ensembling Guide .

Q15: Does stacking work if some models perform poorly? A: Successful stacking requires diverse models with relatively balanced performance; large performance gaps can reduce ensemble benefit.

Q16: Does Tencent currently use online training? A: Yes, online mini‑batch training is used.

Q17: Why do some records show two conversions for the same event, and how to handle it? A: The train.csv was inspected and no such case was found; the issue may stem from data merging problems.

Q18: Why does using the full dataset for baseline in the final round perform worse than using only a few days, unlike the preliminary round? A: The effect is not directly tied to data volume; recent data often carries more importance, highlighting the advantage of online learning that continuously adapts to new distributions.

For more details, visit the official competition website: http://algo.tpai.qq.com . Official WeChat for contest updates and gifts: TSA-Contest .

Reference material: Kaggle experience sharing (see images above).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data mining online learning competition model stacking Vowpal Wabbit

Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.