What Makes a Good CTR Benchmark? Lessons from Huawei’s FuxiCTR
The article analyzes the shortcomings of current click‑through‑rate benchmarks, explains why leaderboards are valuable, and proposes concrete criteria—including online evaluation, sequential test data, leakage prevention, and read‑only submissions—to build a more realistic and robust CTR benchmarking platform.
The article analyses what constitutes an effective benchmark for click‑through‑rate (CTR) prediction, motivated by Huawei’s open‑source framework FuxiCTR and the accompanying paper “Open Benchmarking for Click‑Through Rate Prediction”.
Why Leaderboards Matter
Uniform test‑set split with hidden labels prevents participants from over‑fitting to custom validation splits.
Allows the use of engineering tricks, so models are compared against a common baseline rather than hidden optimisations.
Enables thousands of teams to verify reproducibility, openness, and generalisation of reported results.
Leaderboard datasets are usually recent and large (often millions of instances), reducing the risk of label leakage or over‑optimization.
Evaluating on multiple datasets captures model variance across different domains.
CTR vs. CV/NLP Benchmarks
Academic CTR research typically relies on static offline datasets, whereas industrial CTR systems process billions of rows from dozens of tables and serve predictions online. Consequently, practitioners often find data quality and feature engineering more decisive than sophisticated model architectures, while many papers focus on adding attention or transformer components that provide limited practical gain.
Limitations of Existing CTR Benchmarks
Huawei’s benchmark evaluates several models on the Criteo and Avazu datasets. These datasets originate from competitions held many years ago and no longer reflect the scale, feature diversity, or freshness of modern advertising data, limiting their persuasive power.
Desired Characteristics of a Good CTR Benchmark
Provide an online evaluation environment together with extensive offline training data (multiple tables, heterogeneous features, and large sample size).
Deliver test data sequentially by time slice, mimicking real‑world traffic.
Within each time slice, enforce strict leakage prevention (e.g., filter consecutive user actions or allocate them to separate inference windows).
Make each test‑batch submission read‑only; participants cannot modify previously submitted results.
Recent advertising competitions, such as Tencent’s yearly ad contests, supply richer, more up‑to‑date data than the legacy Criteo/Avazu sets.
Illustrative Evaluation Loop
for test_batch in all_batch_test:
test_feature = get_feature(test_batch)
result = model.predict(test_feature)
env.commit(result)Benefits of This Setup
Simulates industrial online inference scenarios, including latency constraints and streaming data.
Prevents models from exploiting feature or temporal leakage.
Keeps the test set hidden, discouraging test‑set tuning.
Creates a shared benchmark that reduces self‑validation bias and facilitates fair comparison.
A concrete example of an online‑style competition is Kaggle’s “Riid Test‑Answer Prediction” challenge, which provides a simulated online prediction environment.
References (plain URLs): https://arxiv.org/pdf/2009.05794.pdf ; https://github.com/xue-pai/FuxiCTR ; https://www.kaggle.com/c/riiid-test-answer-prediction
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
