Large‑Scale Machine Learning and AutoML Techniques for Search Advertising CTR Prediction
The article explains how large‑scale machine learning and AutoML are applied to search advertising click‑through‑rate (CTR) prediction, covering problem definition, feature generation, model training, optimization methods, distributed systems, and recent advances in AutoML with practical case studies.
This report, based on a talk by Summer Xia, CEO and chief scientist of Zhizhou Technology, introduces machine learning in search advertising, describing the CTR estimation problem, the advertising scenario, and the four‑step pipeline of feature generation, probability modeling, model training, and online prediction.
The article details how features are vectorized, how discrete features are one‑hot encoded, and how high‑dimensional sparse vectors are created for billions of ads, users, and queries, illustrating feature cross‑products and the resulting massive feature space.
Various dimensionality‑reduction techniques such as hashing, statistical aggregation, and embedding are presented, followed by the use of sigmoid probability models and shallow versus deep networks for CTR estimation.
Model training is formulated as a maximum‑likelihood optimization problem, often regularized to avoid over‑fitting, and solved with first‑order approximations such as LBFGS, coordinate descent, or stochastic gradient descent due to the enormous feature dimensionality.
In production, a distributed parameter‑server architecture is used to train and serve models, supporting data parallelism, model parallelism, or hybrid parallelism for advertising workloads.
The second part reviews the evolution of machine learning—from rule‑based systems to large‑scale linear models, and finally to AutoML, which automates data‑feature‑algorithm‑hyperparameter‑evaluation loops.
AutoML is defined as the automatic search for optimal hyper‑parameters and model architectures, facing challenges such as complex hyper‑parameter spaces, non‑differentiable objectives, and high evaluation cost.
Two main solution families are described: (1) search‑based methods, including grid search, random search, genetic algorithms, and (2) AI‑for‑AI approaches that model the hyper‑parameter performance surface and use Bayesian optimization or meta‑learning to guide the search.
A practical case study shows three teams solving a large‑scale credit‑risk prediction task with AutoML, a custom neural network, and a traditional LR+GBDT pipeline, highlighting AutoML’s efficiency and comparable accuracy.
The article concludes with author information: Summer Xia, Ph.D., former senior scientist at Baidu, leader of large‑scale machine learning platforms, and a prominent contributor to CTR prediction systems.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.