Artificial Intelligence 20 min read

Elastic Feature Scaling: Boosting Alibaba’s Online Recommendation CTR by 4%

This article describes how Ant Financial’s AI team redesigned TensorFlow to enable elastic feature scaling, introduced a Group‑Lasso optimizer and streaming frequency filtering, compressed models by 90%, and achieved significant CTR and efficiency gains in Alipay’s online recommendation system.

Alibaba Cloud Developer

Dec 28, 2018

Elastic Feature Scaling: Boosting Alibaba’s Online Recommendation CTR by 4%

0. Overview

Online learning can capture dynamic user behavior and quickly adapt models, but it imposes strict stability and performance requirements on the recommendation pipeline. The Ant Financial AI team identified three core challenges in native TensorFlow‑based online recommendation: massive long‑tail features requiring aggressive feature‑map truncation, unpredictable feature space growth during training, and poor model sparsity leading to multi‑gigabyte model sizes.

1. Elastic Refactoring and Benefits

By modifying TensorFlow’s variable handling to a hash‑based HashVariable that supports on‑demand feature creation, the system removes the fixed‑dimension limitation. This enables elastic feature scaling for billions of parameters, introduces a Group‑Lasso optimizer and frequency‑based filtering to improve sparsity, and compresses model size by 90% while increasing link efficiency.

Key capabilities:

Elastic feature scaling supporting hundred‑billion‑parameter training.

Group‑Lasso optimizer and frequency filtering that enhance sparsity and online CTR.

90% model‑size reduction with comprehensive feature management and stability monitoring.

When declaring such a variable, only a single additional line is needed; the rest of the training code remains unchanged.

2. Dynamic Feature Add/Drop Techniques

The elastic architecture implements two core techniques: a streaming frequency filter that decides whether a feature should enter training, and a Group‑Lasso optimizer that can delete entire embedding groups. Group‑Lasso adds an L21 regularization term to the loss, allowing whole‑group pruning while preserving discriminative power.

2.1 Group Lasso Optimizer

Traditional L1‑based sparsity methods do not work well for sparse DNN embeddings. By applying L21 regularization inside the embedding layer, the optimizer can zero out all parameters of a feature, effectively removing it from the model.

2.2 Streaming Frequency Filtering

Features are treated as a Poisson process; the system estimates the probability that a feature should be admitted based on its observed count n and current step t . A Bernoulli sample decides admission, achieving near‑offline filtering performance without pre‑allocating space.

Dynamic L1 regularization further adjusts the L1 coefficient according to feature frequency, reducing low‑frequency noise while preserving high‑frequency useful features.

3. Model Compression and Stability

After training, many zero vectors remain; a graph‑cut tool removes non‑essential ops and converts remaining variables to native TensorFlow mutable hash tables, shrinking an 8 GB model to a few hundred megabytes without changing inference results.

Stability monitoring tracks sample distribution, training loss, AUC, feature growth, and business metrics (uvCTR, pvCTR). Alerts are triggered via HTTP when anomalies appear.

4. Engineering Implementation and Results

The solution is deployed on multiple recommendation slots on Alipay’s homepage. Using a Wide‑&‑Deep architecture with group embeddings, the online‑learning bucket achieved a 4.23 % uplift over the best multi‑model fusion bucket and a 34.67 % gain over a random control. An information‑flow recommendation task saw +0.77 % uv‑CTR and +4.78 % pv‑CTR improvements.

5. Future Work

Future directions include sub‑minute latency optimization, online importance sampling, automated feature learning, and joint optimizer decisions for linear programming and DNNs.

References

McMahan, B. “Follow‑the‑regularized‑leader and mirror descent…” 2011.

McMahan, B. et al. “Ad click prediction: a view from the trenches.” 2013.

Yuan, M., Lin, Y. “Model selection and estimation in regression with grouped variables.” 2006.

Andrew, G., Gao, J. “Scalable training of L1‑regularized log‑linear models.” 2007.

Scardapane, S. et al. “Group sparse regularization for deep neural networks.” 2017.

Yang, H. et al. “Online learning for group lasso.” 2010.

Zhou, Y., Jin, R., Chu‑Hong, S. “Exclusive lasso for multi‑task feature selection.” 2010.

Yoon, J., Hwang, S. “Combined group and exclusive sparsity for deep neural networks.” 2017.

Langford, L., Li, T. “Sparse online learning via truncated gradient.” 2009.

Xiao, L. “Dual averaging method for regularized stochastic learning and online optimization.” 2009.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

model compression TensorFlow Recommendation Systems online learning feature scaling group lasso

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.