Mastering Machine Learning Feature Engineering: Scaling, Encoding, Aggregation, Embedding, and Automation

The article explains why good features matter more than fancy algorithms and walks through practical techniques—scaling, log transforms, binning, interaction, various encoding schemes, datetime extraction, text statistics, geospatial distances, aggregation, feature selection, and automated feature generation—illustrated with concrete pandas and scikit‑learn code examples.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
Mastering Machine Learning Feature Engineering: Scaling, Encoding, Aggregation, Embedding, and Automation

Why Features Matter

Good models rely more on high‑quality features than on sophisticated algorithms. The author emphasizes that better features, not flashier models, are the key to performance.

1. Numerical Features

1.1 Scaling

Most ML algorithms are sensitive to feature scale. A column ranging from 0‑1,000,000 will dominate a column ranging from 0‑1. Three common scalers are described:

StandardScaler : best for approximately normal data.

MinMaxScaler : compresses values to [0, 1], suitable for neural networks.

RobustScaler : uses median and IQR, robust to outliers.

from sklearn.preprocessing import RobustScaler

df['salary_scaled'] = RobustScaler().fit_transform(df[['salary']])

⚠️ Scalers must be fitted on the training set only to avoid data leakage.

1.2 Log Transform

Right‑skewed numeric columns (e.g., income, price, revenue) benefit from a log transform to flatten the distribution.

import numpy as np

df['revenue_log'] = np.log1p(df['revenue'])  # log1p safely handles zeros

1.3 Binning

Converting continuous values to categorical bins can be useful. pd.cut() creates equal‑width bins, while pd.qcut() creates quantile‑based bins with equal sample counts.

df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 55, 100],
                         labels=['teen', 'young_adult', 'adult', 'senior'])

1.4 Interaction Features

Combining two features often yields more expressive power than using them separately. Examples include price per square foot and debt‑to‑income ratio.

df['price_per_sqft'] = df['price'] / df['sqft']
df['debt_to_income'] = df['debt'] / df['income']

For linear models, polynomial features can capture non‑linear relationships.

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
# Creates: age, salary, age², salary², age × salary

1.5 Outlier Clipping

Instead of dropping outliers, truncate them to a reasonable percentile range.

lower = df['salary'].quantile(0.01)
upper = df['salary'].quantile(0.99)
df['salary_clipped'] = df['salary'].clip(lower=lower, upper=upper)

2. Categorical Features

2.1 One‑Hot Encoding

Expands each nominal category into a separate 0/1 column. Suitable when the column has few unique values.

df_encoded = pd.get_dummies(df, columns=['city'], drop_first=True)

⚠️ A column with 500 unique categories would create 500 new columns; in such cases target encoding is preferable.

2.2 Label Encoding

Assigns an integer to each category, only appropriate when the categories have an inherent order.

df['education'] = df['education'].map({
    'High School': 0,
    'Bachelor': 1,
    'Master': 2,
    'PhD': 3
})

Do not label‑encode purely nominal data like city names, as the model may infer false ordinal relationships.

2.3 Target Encoding

Replaces each category with the mean of the target variable for that group, effective for high‑cardinality columns.

from category_encoders import TargetEncoder

df['city_encoded'] = TargetEncoder().fit_transform(df['city'], df['churn'])

⚠️ Risk of leakage; use cross‑fold target encoding in production.

2.4 Frequency Encoding

Replaces each category with its occurrence frequency. Simple yet often surprisingly effective for tree‑based models.

freq_map = df['city'].value_counts(normalize=True)
df['city_freq'] = df['city'].map(freq_map)

2.5 Binary Encoding

A compromise between one‑hot and label encoding, producing fewer columns for high‑cardinality features.

from category_encoders import BinaryEncoder

df_encoded = BinaryEncoder().fit_transform(df[['city']])
# 100 categories → only 7 binary columns

3. Datetime Features

Raw dates are rarely useful directly; extract informative components.

3.1 Standard Extraction

df['order_date'] = pd.to_datetime(df['order_date'])
df['month']        = df['order_date'].dt.month
df['day_of_week'] = df['order_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter']     = df['order_date'].dt.quarter
df['days_since']  = (df['order_date'] - pd.Timestamp('2024-01-01')).dt.days

3.2 Cyclical Encoding

Months (or hours) are cyclical; encode them with sine and cosine to preserve adjacency.

import numpy as np

df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

3.3 Calendar Features

import holidays
indian_holidays = holidays.India(years=2025)

df['is_holiday']   = df['order_date'].apply(lambda d: d in indian_holidays).astype(int)
df['is_month_end'] = df['order_date'].dt.is_month_end.astype(int)

4. Text Features

4.1 Basic Statistics

df['word_count']      = df['review'].str.split().str.len()
df['avg_word_len']   = df['review'].str.len() / df['word_count']
df['has_question']   = df['review'].str.contains(r'\\?').astype(int)
df['uppercase_ratio'] = df['review'].apply(
    lambda x: sum(c.isupper() for c in str(x)) / max(len(str(x)), 1)
)

4.2 TF‑IDF

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=100, ngram_range=(1, 2), stop_words='english')
X_tfidf = tfidf.fit_transform(df['review'])

4.3 Sentiment Score

from textblob import TextBlob

df['sentiment'] = df['review'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)
# Range: -1 (very negative) to 1 (very positive)

4.4 Sentence Embeddings

Modern approach: use a pre‑trained model to obtain dense vectors that capture semantics, which often outperforms TF‑IDF in deep‑learning pipelines.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['review'].tolist())
# Shape: (n_rows, 384) – each row becomes 384 numeric features

5. Geospatial Features

5.1 Distance to a Landmark

from math import radians, sin, cos, sqrt, atan2

def haversine(lat1, lon1, lat2, lon2):
    R = 6371
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    a = sin((lat2-lat1)/2)**2 + cos(lat1)*cos(lat2)*sin((lon2-lon1)/2)**2
    return R * 2 * atan2(sqrt(a), sqrt(1-a))

city_centre = (28.6139, 77.2090)

df['dist_to_centre_km'] = df.apply(
    lambda r: haversine(r['lat'], r['lon'], *city_centre), axis=1)

5.2 Geohash

Encodes latitude/longitude into a short string; each prefix corresponds to a geographic region, useful for spatial aggregation.

import pygeohash as pgh

df['geohash_5'] = df.apply(lambda r: pgh.encode(r['lat'], r['lon'], precision=5), axis=1)
# precision 5 ≈ 5 km area

6. Aggregation Features

Aggregated features are especially valuable in production ML systems for customer‑behavior and transaction data.

6.1 Group Aggregations

stats = df.groupby('customer_id').agg(
    total_orders=('order_id', 'count'),
    total_spent=('amount', 'sum'),
    avg_order_value=('amount', 'mean'),
    max_order=('amount', 'max')
).reset_index()

df = df.merge(stats, on='customer_id', how='left')

6.2 Lag and Rolling Features

Past N periods often provide the strongest predictive signal for time‑series data.

df = df.sort_values(['customer_id', 'order_date'])

df['prev_order_amount'] = df.groupby('customer_id')['amount'].shift(1)
df['amount_change'] = df['amount'] - df['prev_order_amount']

df['rolling_30d_spend'] = (
    df.groupby('customer_id')['amount']
      .transform(lambda x: x.rolling(3).sum())
)

7. Feature Selection

Creating features is only half the work; the other half is discarding the useless ones.

7.1 Remove Low‑Variance Features

from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_reduced = selector.fit_transform(X)

7.2 Remove Highly Correlated Features

corr = df.corr().abs()
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
df.drop(columns=to_drop, inplace=True)

7.3 Feature Importance (Tree Models)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
importance = pd.Series(model.feature_importances_, index=X_train.columns)
print(importance.sort_values(ascending=False).head(20))

7.4 SHAP Values

SHAP explains not only which features are important but also the direction and magnitude of each feature’s impact on individual predictions.

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train)

8. Automated Feature Engineering

When the combinatorial space of candidate features becomes huge, manual construction is impractical. Use a program to generate features in bulk and then let the selection pipeline filter them.

import featuretools as ft
es = ft.EntitySet(id='orders')
es = es.add_dataframe(dataframe_name='orders', dataframe=df,
                         index='order_id', time_index='order_date')
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='orders',
    agg_primitives=['sum', 'mean', 'count', 'max', 'std'],
    trans_primitives=['month', 'weekday', 'is_weekend'],
    max_depth=2
)
print(f"Generated {len(feature_defs)} features automatically")

After generation, apply variance filtering, correlation filtering, and then inspect feature‑importance scores; the remaining features are the ones worth keeping.

Conclusion

Feature engineering sits at the intersection of domain knowledge and technical skill. No algorithm can compensate for poor features. Engineers who continuously produce high‑quality models are those who understand the data deeply, start with simple features, quantify incremental gains, and only add complexity when necessary.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningfeature engineeringautomationencodingpandasscikit-learn
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.