Artificial Intelligence 12 min read

Feature Engineering for Structured Data: Normalization, Encoding & Interaction

This article explains the fundamentals of feature engineering for structured data, covering why and how to normalize numerical features, various categorical encoding techniques, methods for creating high‑dimensional interaction features, and decision‑tree based strategies for efficiently discovering valuable feature combinations.

Hulu Beijing

Jan 23, 2018

Feature Engineering for Structured Data: Normalization, Encoding & Interaction

Scene Description

Feature engineering involves finding effective features for a given problem and transforming them into a suitable input format for models. The classic “Garbage in, garbage out” principle highlights that model performance depends not only on algorithm choice but also on quality of input features.

Problem Description

Why normalize numerical features?

How to handle categorical features?

How to process high‑dimensional interaction features?

How to efficiently discover useful feature combinations?

Answer and Analysis

1. Why normalize numerical features?

Normalization scales all numerical features to a similar range, commonly using z‑score normalization (subtract mean μ and divide by standard deviation σ ). This prevents features with larger ranges from dominating gradient‑based optimization such as stochastic gradient descent.

For example, a feature x 1 ranging [0,10] and x 2 ranging [0,3] will have different update speeds under the same learning rate. After scaling both to the same interval, their updates become comparable, allowing faster convergence.

Normalization is required for models trained with gradient descent (linear regression, logistic regression, SVM, neural networks) but not for decision‑tree based models like C4.5, where split decisions depend on order rather than absolute values.

2. How to handle categorical features?

Categorical features (e.g., gender, blood type) are typically represented as strings. Models such as logistic regression or linear SVM require numeric input, so conversion is necessary. Common encoding methods include:

Ordinal Encoding : Assign integer IDs preserving an inherent order.

One‑hot Encoding : Represent each category as a sparse binary vector.

Binary Encoding : Encode ordinal IDs in binary form, reducing dimensionality compared to one‑hot.

These methods help save space, enable sparse vector handling, and can be combined with feature selection to mitigate over‑fitting in high‑dimensional settings.

3. How to process high‑dimensional interaction features?

Interaction features are created by combining pairs of discrete features, increasing model capacity to capture complex relationships. However, naïve combination of high‑cardinality ID features leads to an explosion of parameters (e.g., m × n for user‑item IDs).

Dimensionality reduction techniques such as low‑rank factorization represent each entity with a k‑dimensional vector (k ≪ m, k ≪ n), reducing the parameter count to m·k + n·k and providing a matrix‑factorization perspective.

Example interaction creation: combining language and type features for ad click prediction, illustrated in the following figures.

4. How to efficiently find useful feature combinations?

A decision‑tree based approach can discover valuable feature combinations. By building gradient‑boosted decision trees, each root‑to‑leaf path represents a combination of conditions (e.g., age ≤ 35 & gender = female). These binary encodings can be used as new interaction features.