Artificial Intelligence 20 min read

Feature Engineering: Mapping Raw Data to Machine‑Learning Features and Best Practices

This article explains how feature engineering transforms raw data into numerical representations for machine‑learning models, covering mapping of numeric and categorical values, one‑hot and multi‑hot encoding, sparse representations, scaling, handling outliers, binning, data quality checks, and feature interactions to capture non‑linear relationships.

DataFunTalk
DataFunTalk
DataFunTalk
Feature Engineering: Mapping Raw Data to Machine‑Learning Features and Best Practices

Feature engineering is the process of converting raw data into feature vectors that machine‑learning models can consume, shifting the focus from code to feature representation.

Mapping Raw Data to Features

Most ML models require features as real‑valued vectors because feature values are multiplied by model weights. Figure 1 illustrates how raw input data on the left is transformed into a feature vector on the right.

Mapping Numerical Values

Integers and floats can be used directly; converting an integer 6 to a float 6.0 adds little value (see Figure 2).

Mapping Categorical Values

For a categorical feature such as street_name with possible values {"Charleston Road", "North Shoreline Boulevard", "Shorebird Way", "Rengstorff Avenue"}, we create a vocabulary that maps each string to an integer and reserve an OOV bucket for unseen streets.

Charleston Road → 0

North Shoreline Boulevard → 1

Shorebird Way → 2

Rengstorff Avenue → 3

All other streets (OOV) → 4

Using raw indices directly imposes two limitations: a single weight is learned for all streets, and multiple street values (e.g., a corner house) cannot be represented.

One‑Hot and Multi‑Hot Encoding

To overcome these limits, we create a binary vector for each categorical value (one‑hot) or allow multiple 1s (multi‑hot). Figure 3 shows the one‑hot encoding for "Shorebird Way".

Sparse Representation

When the vocabulary contains millions of categories (e.g., street names), a dense binary vector would be inefficient. Sparse representations store only the non‑zero indices, dramatically reducing memory and computation.

Characteristics of Good Features

Avoid Rare Discrete Values

Features should appear at least ~5 times in the dataset so the model can learn meaningful associations. Extremely rare values like unique_house_id: 8SK982ZZ1242Z provide no learning signal.

Clear and Unambiguous Meaning

Features must be interpretable. For example, house_age: 27 clearly denotes age in years, whereas house_age: 851472000 is meaningless without context.

No Special Placeholder Values

Floating‑point features should not contain out‑of‑range sentinel values. If a missing value is represented by -1 , split it into two features: the original value (without sentinel) and a boolean flag is_quality_rating_defined .

Consider Upstream Instability

Feature definitions should be stable over time. Using stable identifiers like city_id: "br/sao_paulo" is safer than numeric codes that may change (e.g., inferred_city_cluster: "219" ).

Data Cleaning (Representation)

Just as a fruit vendor removes bad apples, ML engineers must filter out unreliable samples, missing values, duplicates, mislabeled data, and erroneous feature values.

Scaling Features

Scaling transforms raw ranges (e.g., 100–900) to a standard range (e.g., 0–1 or –1–+1). Benefits include faster gradient descent convergence, avoidance of NaN traps, and balanced weight learning across features.

Linear min‑max scaling: map [min, max] → [–1, +1]

Z‑score scaling: scaled_value = (value - mean) / stddev

Example: mean = 100, stddev = 20, raw = 130 → scaled = 1.5.

Handling Extreme Outliers

Clipping extreme values (e.g., capping roomsPerPerson at 4.0) reduces the long tail while preserving useful information.

Binning

Continuous features like latitude can be discretized into bins, then one‑hot encoded. Eleven latitude bins become an 11‑element binary vector, which can be further represented as a single 11‑dimensional vector.

Data Auditing

Verify data quality by checking for missing values, duplicates, bad labels, and anomalous feature values. Compute summary statistics (min, max, mean, median, stddev) and inspect frequent discrete values (e.g., country:uk ).

Feature Interactions: Encoding Non‑Linear Relationships

Linear models cannot separate non‑linear patterns (Figure 10). Creating interaction features by multiplying two or more inputs (cross product) encodes non‑linear information while keeping the model linear.

Types of Interactions

[A × B] – product of two features

[A × B × C × D × E] – product of five features

[A × A] – square of a single feature

These interactions allow linear learners to capture complex patterns efficiently.

Combining One‑Hot Vectors

When categorical features are one‑hot encoded, their interactions become logical conjunctions (e.g., country=USA AND language=Spanish ). For binned latitude (5 bins) and longitude (5 bins), the interaction yields a 25‑element one‑hot vector representing each latitude‑longitude pair.

binned_latitude(lat) = [
  0 < lat <= 10,
  10 < lat <= 20,
  20 < lat <= 30
]

binned_longitude(lon) = [
  0 < lon <= 15,
  15 < lon <= 30
]

binned_latitude_X_longitude(lat, lon) = [
  0 < lat <= 10 AND 0 < lon <= 15,
  0 < lat <= 10 AND 15 < lon <= 30,
  10 < lat <= 20 AND 0 < lon <= 15,
  10 < lat <= 20 AND 15 < lon <= 30,
  20 < lat <= 30 AND 0 < lon <= 15,
  20 < lat <= 30 AND 15 < lon <= 30
]

Interactions can also be built from behavioral and temporal features (e.g., [behavior type X time of day] ) to dramatically improve predictive power.

In summary, careful feature engineering—mapping, encoding, scaling, cleaning, and interaction creation—greatly influences the upper bound of model performance, while the choice of algorithm and optimization fine‑tunes the solution.

Thank you for reading.

Machine Learningfeature engineeringencodingdata preprocessingfeature interactionscalingsparse representation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.