Fundamentals 6 min read

Why a Simple Frequency Count Can Outperform Complex Models for Categorical Features

This article explains how a five‑line pandas script can generate frequency‑based encodings for categorical columns, why such encodings help low‑frequency categories, and how they integrate with feature crossing and industry use cases like ad exposure, PV/UV counting, and fraud detection.

Baobao Algorithm Notes

Jan 10, 2022

Why a Simple Frequency Count Can Outperform Complex Models for Categorical Features

In many industrial tabular datasets, gradient‑boosted decision trees (GBDT) such as LightGBM, XGBoost, and CatBoost dominate because they handle categorical features well. One of the most effective tricks is to replace raw categories with their frequency counts (value_counts), which can be implemented in just five lines of pandas.

First Layer – Frequency Encoding

Frequency encoding treats each category as a hash that maps low‑frequency values to a common numeric representation, reducing over‑fitting on rare categories. The following code demonstrates the technique on a small three‑column dataframe:

import pandas as pd

df = pd.DataFrame({
    'Region': ['Xi\'an', 'Taiyuan', 'Xi\'an', 'Taiyuan', 'Zhengzhou', 'Taiyuan'],
    'Oct_Sales': ['0.477468', '0.195046', '0.015964', '0.259654', '0.856412', '0.259644'],
    'Sep_Sales': ['0.347705', '0.151220', '0.895599', '0.236547', '0.569841', '0.254784']
})
# Count frequencies of the 'Region' column
df_counts = df['Region'].value_counts().reset_index()
df_counts.columns = ['Region', 'Region_Freq']
# Merge the frequency back into the original dataframe
df = df.merge(df_counts, on='Region', how='left')

The resulting Region_Freq column provides a numeric feature that treats all rare regions uniformly, acting like a smart hash encoding.

Second Layer – Business Meaning of Frequencies

In offline datasets, frequency counts carry concrete business semantics:

Ad ID frequency → number of exposures.

Page ID frequency → page views (PV).

User ID frequency → unique visitors (UV).

These counts can reveal patterns such as unusually high click rates for a user (potential click farms) or a surge of orders from a single user (possible fraud).

Third Layer – Interaction with Feature Crossing

When combined with feature crossing, frequency‑encoded columns enrich interaction terms. For example, crossing region with product category yields a more granular view of sales performance, while still benefiting from the low‑frequency smoothing provided by the frequency count.

However, feature crossing increases the tail of the distribution, making learning harder; the frequency encoding helps mitigate this by normalising rare interactions.

Note that the described "one‑stroke" frequency encoding works best on offline training data where train and test share the same distribution without bias. In biased or online scenarios, additional handling (e.g., smoothing or target encoding) may be required.

Overall, a simple frequency count is a powerful, low‑cost feature engineering technique that improves model robustness, especially for GBDT models on categorical data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GBDT feature engineering Data preprocessing categorical encoding frequency encoding

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.