Mastering Machine Learning Feature Engineering: Scaling, Encoding, Aggregation, Embedding, and Automation
The article explains why good features matter more than fancy algorithms and walks through practical techniques—scaling, log transforms, binning, interaction, various encoding schemes, datetime extraction, text statistics, geospatial distances, aggregation, feature selection, and automated feature generation—illustrated with concrete pandas and scikit‑learn code examples.
Why Features Matter
Good models rely more on high‑quality features than on sophisticated algorithms. The author emphasizes that better features, not flashier models, are the key to performance.
1. Numerical Features
1.1 Scaling
Most ML algorithms are sensitive to feature scale. A column ranging from 0‑1,000,000 will dominate a column ranging from 0‑1. Three common scalers are described:
StandardScaler : best for approximately normal data.
MinMaxScaler : compresses values to [0, 1], suitable for neural networks.
RobustScaler : uses median and IQR, robust to outliers.
from sklearn.preprocessing import RobustScaler
df['salary_scaled'] = RobustScaler().fit_transform(df[['salary']])⚠️ Scalers must be fitted on the training set only to avoid data leakage.
1.2 Log Transform
Right‑skewed numeric columns (e.g., income, price, revenue) benefit from a log transform to flatten the distribution.
import numpy as np
df['revenue_log'] = np.log1p(df['revenue']) # log1p safely handles zeros1.3 Binning
Converting continuous values to categorical bins can be useful. pd.cut() creates equal‑width bins, while pd.qcut() creates quantile‑based bins with equal sample counts.
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 55, 100],
labels=['teen', 'young_adult', 'adult', 'senior'])1.4 Interaction Features
Combining two features often yields more expressive power than using them separately. Examples include price per square foot and debt‑to‑income ratio.
df['price_per_sqft'] = df['price'] / df['sqft']
df['debt_to_income'] = df['debt'] / df['income']For linear models, polynomial features can capture non‑linear relationships.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
# Creates: age, salary, age², salary², age × salary1.5 Outlier Clipping
Instead of dropping outliers, truncate them to a reasonable percentile range.
lower = df['salary'].quantile(0.01)
upper = df['salary'].quantile(0.99)
df['salary_clipped'] = df['salary'].clip(lower=lower, upper=upper)2. Categorical Features
2.1 One‑Hot Encoding
Expands each nominal category into a separate 0/1 column. Suitable when the column has few unique values.
df_encoded = pd.get_dummies(df, columns=['city'], drop_first=True)⚠️ A column with 500 unique categories would create 500 new columns; in such cases target encoding is preferable.
2.2 Label Encoding
Assigns an integer to each category, only appropriate when the categories have an inherent order.
df['education'] = df['education'].map({
'High School': 0,
'Bachelor': 1,
'Master': 2,
'PhD': 3
})Do not label‑encode purely nominal data like city names, as the model may infer false ordinal relationships.
2.3 Target Encoding
Replaces each category with the mean of the target variable for that group, effective for high‑cardinality columns.
from category_encoders import TargetEncoder
df['city_encoded'] = TargetEncoder().fit_transform(df['city'], df['churn'])⚠️ Risk of leakage; use cross‑fold target encoding in production.
2.4 Frequency Encoding
Replaces each category with its occurrence frequency. Simple yet often surprisingly effective for tree‑based models.
freq_map = df['city'].value_counts(normalize=True)
df['city_freq'] = df['city'].map(freq_map)2.5 Binary Encoding
A compromise between one‑hot and label encoding, producing fewer columns for high‑cardinality features.
from category_encoders import BinaryEncoder
df_encoded = BinaryEncoder().fit_transform(df[['city']])
# 100 categories → only 7 binary columns3. Datetime Features
Raw dates are rarely useful directly; extract informative components.
3.1 Standard Extraction
df['order_date'] = pd.to_datetime(df['order_date'])
df['month'] = df['order_date'].dt.month
df['day_of_week'] = df['order_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['order_date'].dt.quarter
df['days_since'] = (df['order_date'] - pd.Timestamp('2024-01-01')).dt.days3.2 Cyclical Encoding
Months (or hours) are cyclical; encode them with sine and cosine to preserve adjacency.
import numpy as np
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)3.3 Calendar Features
import holidays
indian_holidays = holidays.India(years=2025)
df['is_holiday'] = df['order_date'].apply(lambda d: d in indian_holidays).astype(int)
df['is_month_end'] = df['order_date'].dt.is_month_end.astype(int)4. Text Features
4.1 Basic Statistics
df['word_count'] = df['review'].str.split().str.len()
df['avg_word_len'] = df['review'].str.len() / df['word_count']
df['has_question'] = df['review'].str.contains(r'\\?').astype(int)
df['uppercase_ratio'] = df['review'].apply(
lambda x: sum(c.isupper() for c in str(x)) / max(len(str(x)), 1)
)4.2 TF‑IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=100, ngram_range=(1, 2), stop_words='english')
X_tfidf = tfidf.fit_transform(df['review'])4.3 Sentiment Score
from textblob import TextBlob
df['sentiment'] = df['review'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)
# Range: -1 (very negative) to 1 (very positive)4.4 Sentence Embeddings
Modern approach: use a pre‑trained model to obtain dense vectors that capture semantics, which often outperforms TF‑IDF in deep‑learning pipelines.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['review'].tolist())
# Shape: (n_rows, 384) – each row becomes 384 numeric features5. Geospatial Features
5.1 Distance to a Landmark
from math import radians, sin, cos, sqrt, atan2
def haversine(lat1, lon1, lat2, lon2):
R = 6371
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
a = sin((lat2-lat1)/2)**2 + cos(lat1)*cos(lat2)*sin((lon2-lon1)/2)**2
return R * 2 * atan2(sqrt(a), sqrt(1-a))
city_centre = (28.6139, 77.2090)
df['dist_to_centre_km'] = df.apply(
lambda r: haversine(r['lat'], r['lon'], *city_centre), axis=1)5.2 Geohash
Encodes latitude/longitude into a short string; each prefix corresponds to a geographic region, useful for spatial aggregation.
import pygeohash as pgh
df['geohash_5'] = df.apply(lambda r: pgh.encode(r['lat'], r['lon'], precision=5), axis=1)
# precision 5 ≈ 5 km area6. Aggregation Features
Aggregated features are especially valuable in production ML systems for customer‑behavior and transaction data.
6.1 Group Aggregations
stats = df.groupby('customer_id').agg(
total_orders=('order_id', 'count'),
total_spent=('amount', 'sum'),
avg_order_value=('amount', 'mean'),
max_order=('amount', 'max')
).reset_index()
df = df.merge(stats, on='customer_id', how='left')6.2 Lag and Rolling Features
Past N periods often provide the strongest predictive signal for time‑series data.
df = df.sort_values(['customer_id', 'order_date'])
df['prev_order_amount'] = df.groupby('customer_id')['amount'].shift(1)
df['amount_change'] = df['amount'] - df['prev_order_amount']
df['rolling_30d_spend'] = (
df.groupby('customer_id')['amount']
.transform(lambda x: x.rolling(3).sum())
)7. Feature Selection
Creating features is only half the work; the other half is discarding the useless ones.
7.1 Remove Low‑Variance Features
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_reduced = selector.fit_transform(X)7.2 Remove Highly Correlated Features
corr = df.corr().abs()
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
df.drop(columns=to_drop, inplace=True)7.3 Feature Importance (Tree Models)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
importance = pd.Series(model.feature_importances_, index=X_train.columns)
print(importance.sort_values(ascending=False).head(20))7.4 SHAP Values
SHAP explains not only which features are important but also the direction and magnitude of each feature’s impact on individual predictions.
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train)8. Automated Feature Engineering
When the combinatorial space of candidate features becomes huge, manual construction is impractical. Use a program to generate features in bulk and then let the selection pipeline filter them.
import featuretools as ft
es = ft.EntitySet(id='orders')
es = es.add_dataframe(dataframe_name='orders', dataframe=df,
index='order_id', time_index='order_date')
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name='orders',
agg_primitives=['sum', 'mean', 'count', 'max', 'std'],
trans_primitives=['month', 'weekday', 'is_weekend'],
max_depth=2
)
print(f"Generated {len(feature_defs)} features automatically")After generation, apply variance filtering, correlation filtering, and then inspect feature‑importance scores; the remaining features are the ones worth keeping.
Conclusion
Feature engineering sits at the intersection of domain knowledge and technical skill. No algorithm can compensate for poor features. Engineers who continuously produce high‑quality models are those who understand the data deeply, start with simple features, quantify incremental gains, and only add complexity when necessary.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
