Why a Simple Frequency Count Can Outperform Complex Models for Categorical Features
This article explains how a five‑line pandas script can generate frequency‑based encodings for categorical columns, why such encodings help low‑frequency categories, and how they integrate with feature crossing and industry use cases like ad exposure, PV/UV counting, and fraud detection.
In many industrial tabular datasets, gradient‑boosted decision trees (GBDT) such as LightGBM, XGBoost, and CatBoost dominate because they handle categorical features well. One of the most effective tricks is to replace raw categories with their frequency counts (value_counts), which can be implemented in just five lines of pandas.
First Layer – Frequency Encoding
Frequency encoding treats each category as a hash that maps low‑frequency values to a common numeric representation, reducing over‑fitting on rare categories. The following code demonstrates the technique on a small three‑column dataframe:
import pandas as pd
df = pd.DataFrame({
'Region': ['Xi\'an', 'Taiyuan', 'Xi\'an', 'Taiyuan', 'Zhengzhou', 'Taiyuan'],
'Oct_Sales': ['0.477468', '0.195046', '0.015964', '0.259654', '0.856412', '0.259644'],
'Sep_Sales': ['0.347705', '0.151220', '0.895599', '0.236547', '0.569841', '0.254784']
})
# Count frequencies of the 'Region' column
df_counts = df['Region'].value_counts().reset_index()
df_counts.columns = ['Region', 'Region_Freq']
# Merge the frequency back into the original dataframe
df = df.merge(df_counts, on='Region', how='left')The resulting Region_Freq column provides a numeric feature that treats all rare regions uniformly, acting like a smart hash encoding.
Second Layer – Business Meaning of Frequencies
In offline datasets, frequency counts carry concrete business semantics:
Ad ID frequency → number of exposures.
Page ID frequency → page views (PV).
User ID frequency → unique visitors (UV).
These counts can reveal patterns such as unusually high click rates for a user (potential click farms) or a surge of orders from a single user (possible fraud).
Third Layer – Interaction with Feature Crossing
When combined with feature crossing, frequency‑encoded columns enrich interaction terms. For example, crossing region with product category yields a more granular view of sales performance, while still benefiting from the low‑frequency smoothing provided by the frequency count.
However, feature crossing increases the tail of the distribution, making learning harder; the frequency encoding helps mitigate this by normalising rare interactions.
Note that the described "one‑stroke" frequency encoding works best on offline training data where train and test share the same distribution without bias. In biased or online scenarios, additional handling (e.g., smoothing or target encoding) may be required.
Overall, a simple frequency count is a powerful, low‑cost feature engineering technique that improves model robustness, especially for GBDT models on categorical data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
