Artificial Intelligence 14 min read

Tencent Social Ads Algorithm Competition: Expert Q&A FAQ and Technical Insights

The expert live Q&A session for Tencent's Social Ads algorithm competition was recorded and the full FAQ, covering data labeling, feature handling, model selection, imbalance mitigation, and practical engineering tips, is presented here for participants and researchers alike.

Tencent Advertising Technology

May 19, 2017

Tencent Social Ads Algorithm Competition: Expert Q&A FAQ and Technical Insights

Yesterday Tencent Social Ads algorithm competition held a live expert Q&A session that attracted both on‑site and online participants; the experts answered a wide range of technical questions and the full FAQ is reproduced below.

1. Is yi required as test data and how is it evaluated? yi is the label for the test set and is not provided. Submissions must contain instanceID and predicted probability; the platform computes log‑loss against the hidden answers (see https://www.kaggle.com/wiki/LogLoss).

2. What do the identifiers for 2G, 3G, 4G‑WiFi users represent? All feature values are encrypted and the mapping to IDs is not disclosed for data‑security reasons, which does not affect solving the problem.

3. What does appID mean and how does it relate to appCategory? appID is an encrypted identifier for a specific app; each appID has an associated category listed in the file app_categories.csv.

4. How to handle the large dataset on a modest machine? A machine with 16 GB RAM and 8‑core CPU is sufficient for the preliminary data; for the final round Tencent Cloud machines will be offered. Feature selection or more efficient algorithms can also reduce resource usage.

5. How to process multi‑value categorical variables such as creativeID? Use one‑hot encoding (e.g., gender → [1,0,0], [0,1,0], [0,0,1]) or other methods described in the baseline article.

6. How to deal with severe class imbalance and feature combinations? Down‑sample the majority (negative) class, possibly with multiple random seeds and ensemble the models; calibrate predictions after sampling. For feature combos, use Cartesian product for ID features or arithmetic transformations for numeric features, and consider models like XGBoost or DNN.

7. Why does XGBoost still overfit despite regularization? Overfitting may stem from data‑time drift or improper train/validation split; ensure validation set is representative and avoid data leakage in feature engineering.

8. What is the overall advertising workflow? Advertisers create creative assets, select ad slots, define target audiences and budgets, submit for review, and then the ads are displayed. Conversion tracking via SDK or API is encouraged to improve ROI.

9. Why do some users have multiple clicks but only the last click is linked to conversion? When multiple clicks occur, the conversion is attributed to the last click; such cases are rare.

10. Any classic data‑cleaning steps for CTR problems? Merge the provided files, apply one‑hot encoding to ID features, and handle the variable‑length app list feature with appropriate transformations; refer to TalkingData competitions for examples.

11. Common feature engineering for CTR tasks? Use ID cross features, ID statistical features, and study open‑source solutions from Criteo, Avazu, and Avito competitions.

12. What does “APP” refer to in this competition? It refers to mobile applications on smartphones.

13. How to break a performance plateau? Model ensembling, deeper data analysis, richer feature engineering, and careful hyper‑parameter tuning can yield significant gains.

14. When to choose FFM vs. XGBoost? FFM handles sparse ID features well but needs strong feature engineering; XGBoost is fast and works off‑the‑shelf. Their relative performance depends on data and task specifics.

15. Advice for beginners struggling with feature extraction? Treat feature engineering as an art; study existing solutions, perform thorough data analysis, and iterate.

16. Why does adding userID improve model performance? userID is a unique encrypted identifier that captures individual behavior; clustering is optional but not required.

17. Does a tiny gain from one‑hot encoding mean it’s unnecessary? For linear models (LR, DNN) one‑hot is appropriate; tree‑based models (RF, XGBoost) can ingest raw categorical IDs directly.

18. Why are conversion‑rate and count features ineffective? Check for leakage and time‑based issues; refer to Criteo/Avazu/Avito for proven feature sets.

19. How to represent relationships such as apps frequently installed by a user? Treat the app list as a document and apply bag‑of‑words, or compute statistical aggregates per user.

20. How to build a reliable offline validation set with time‑series data? Split by chronological order, account for delayed conversion (backflow) when labeling, and compare validation distribution with real conversion patterns.

21. Are the other six files complete? Yes, they contain all user, ad, and placement data needed for the competition.

22. How to treat duplicate samples in train.csv? Duplicates are genuine behavior records and should be kept as they carry information.

For more details, visit the official competition website and follow the provided links.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising CTR Prediction competition XGBoost

Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.