Turning Ad Click Sequences into Age & Gender Predictions with Transformers
This article shares a competition winner's step‑by‑step solution for predicting user age and gender from ad click sequences, treating IDs as words, using word2vec embeddings, a custom transformer‑LSTM model, dual‑task loss, and weight‑search post‑processing.
Problem Overview
The Tencent Advertising Algorithm Competition asks participants to predict a user's age and gender based on the sequence of ads they click. The author, a high‑scoring contestant, treats each ad ID as a token, turning the task into a text‑classification problem under a privacy‑preserving setting.
Solution Idea
All IDs are concatenated into a sentence, enabling the use of natural‑language techniques. Word embeddings (e.g., word2vec skip‑gram) are employed, with careful tuning of the window size for large datasets.
Model Architecture
The final architecture consists of five input features followed by a single‑layer transformer, then an LSTM, and finally a dual‑task output head for age and gender. BERT was considered but discarded because the custom ID vocabulary exceeds 3 million tokens, making pre‑training prohibitively expensive and yielding poorer results in experiments.
Implementation Details
Key implementation points include:
Freezing the embedding layer due to the massive vocabulary.
Feeding the click‑time sequence as an attention mask to the transformer.
Using HuggingFace's transformer modules directly.
Relevant code snippets:
from transformers.modeling_bert import BertConfig, BertEncoder, BertAttention, BertSelfAttention, BertLayer, BertPoolerAfter the transformer, a single LSTM layer is added, followed by max‑pooling (average pooling was tested but did not improve scores). The model splits into two branches for the two prediction tasks.
Loss Function
A custom loss combines the cross‑entropy losses of both tasks equally:
def custom_loss(data1, targets1, data2, targets2):
loss1 = nn.CrossEntropyLoss()(data1, targets1)
loss2 = nn.CrossEntropyLoss()(data2, targets2)
return loss1 * 0.5 + loss2 * 0.5The weighting can be adjusted to favor one task over the other.
Post‑Processing
Since metrics like accuracy and F1 depend on decision thresholds, class‑specific weights are applied to the softmax outputs before taking argmax. A simple weight‑search algorithm iteratively adjusts class weights to maximize validation accuracy:
class_num = 10
weights = [1.0] * class_num
def search_weight(valid_y, raw_prob, init_weight=[1.0]*class_num, step=0.001):
weight = init_weight.copy()
f_best = accuracy_score(y_true=valid_y, y_pred=raw_prob.argmax(axis=1))
flag_score = 0
round_num = 1
while flag_score != f_best:
print("round: ", round_num)
round_num += 1
flag_score = f_best
for c in range(class_num):
for n_w in range(0, 2000, 10):
num = n_w * step
new_weight = weight.copy()
new_weight[c] = num
prob_df = raw_prob.copy() * np.array(new_weight)
f = accuracy_score(y_true=valid_y, y_pred=prob_df.argmax(axis=1))
if f > f_best:
weight = new_weight.copy()
f_best = f
print(f)
return weightThis search can be extended to explore combinations of transformer, LSTM, and CNN feature extractors.
Conclusion
The presented pipeline—ID‑as‑word tokenization, word2vec embeddings, a lightweight transformer‑LSTM model, dual‑task loss, and weight‑search post‑processing—achieved a top‑5 score (1.4516) in the competition. The author encourages further experimentation with CNNs and other feature‑combination strategies.
Tencent Advertising Technology
Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
