Feature Engineering: Mapping Raw Data to Machine‑Learning Features and Best Practices
This article explains how feature engineering transforms raw data into numerical representations for machine‑learning models, covering mapping of numeric and categorical values, one‑hot and multi‑hot encoding, sparse representations, scaling, handling outliers, binning, data quality checks, and feature interactions to capture non‑linear relationships.
Feature engineering is the process of converting raw data into feature vectors that machine‑learning models can consume, shifting the focus from code to feature representation.
Mapping Raw Data to Features
Most ML models require features as real‑valued vectors because feature values are multiplied by model weights. Figure 1 illustrates how raw input data on the left is transformed into a feature vector on the right.
Mapping Numerical Values
Integers and floats can be used directly; converting an integer 6 to a float 6.0 adds little value (see Figure 2).
Mapping Categorical Values
For a categorical feature such as street_name with possible values {"Charleston Road", "North Shoreline Boulevard", "Shorebird Way", "Rengstorff Avenue"}, we create a vocabulary that maps each string to an integer and reserve an OOV bucket for unseen streets.
Charleston Road → 0
North Shoreline Boulevard → 1
Shorebird Way → 2
Rengstorff Avenue → 3
All other streets (OOV) → 4
Using raw indices directly imposes two limitations: a single weight is learned for all streets, and multiple street values (e.g., a corner house) cannot be represented.
One‑Hot and Multi‑Hot Encoding
To overcome these limits, we create a binary vector for each categorical value (one‑hot) or allow multiple 1s (multi‑hot). Figure 3 shows the one‑hot encoding for "Shorebird Way".
Sparse Representation
When the vocabulary contains millions of categories (e.g., street names), a dense binary vector would be inefficient. Sparse representations store only the non‑zero indices, dramatically reducing memory and computation.
Characteristics of Good Features
Avoid Rare Discrete Values
Features should appear at least ~5 times in the dataset so the model can learn meaningful associations. Extremely rare values like unique_house_id: 8SK982ZZ1242Z provide no learning signal.
Clear and Unambiguous Meaning
Features must be interpretable. For example, house_age: 27 clearly denotes age in years, whereas house_age: 851472000 is meaningless without context.
No Special Placeholder Values
Floating‑point features should not contain out‑of‑range sentinel values. If a missing value is represented by -1 , split it into two features: the original value (without sentinel) and a boolean flag is_quality_rating_defined .
Consider Upstream Instability
Feature definitions should be stable over time. Using stable identifiers like city_id: "br/sao_paulo" is safer than numeric codes that may change (e.g., inferred_city_cluster: "219" ).
Data Cleaning (Representation)
Just as a fruit vendor removes bad apples, ML engineers must filter out unreliable samples, missing values, duplicates, mislabeled data, and erroneous feature values.
Scaling Features
Scaling transforms raw ranges (e.g., 100–900) to a standard range (e.g., 0–1 or –1–+1). Benefits include faster gradient descent convergence, avoidance of NaN traps, and balanced weight learning across features.
Linear min‑max scaling: map [min, max] → [–1, +1]
Z‑score scaling: scaled_value = (value - mean) / stddev
Example: mean = 100, stddev = 20, raw = 130 → scaled = 1.5.
Handling Extreme Outliers
Clipping extreme values (e.g., capping roomsPerPerson at 4.0) reduces the long tail while preserving useful information.
Binning
Continuous features like latitude can be discretized into bins, then one‑hot encoded. Eleven latitude bins become an 11‑element binary vector, which can be further represented as a single 11‑dimensional vector.
Data Auditing
Verify data quality by checking for missing values, duplicates, bad labels, and anomalous feature values. Compute summary statistics (min, max, mean, median, stddev) and inspect frequent discrete values (e.g., country:uk ).
Feature Interactions: Encoding Non‑Linear Relationships
Linear models cannot separate non‑linear patterns (Figure 10). Creating interaction features by multiplying two or more inputs (cross product) encodes non‑linear information while keeping the model linear.
Types of Interactions
[A × B] – product of two features
[A × B × C × D × E] – product of five features
[A × A] – square of a single feature
These interactions allow linear learners to capture complex patterns efficiently.
Combining One‑Hot Vectors
When categorical features are one‑hot encoded, their interactions become logical conjunctions (e.g., country=USA AND language=Spanish ). For binned latitude (5 bins) and longitude (5 bins), the interaction yields a 25‑element one‑hot vector representing each latitude‑longitude pair.
binned_latitude(lat) = [
0 < lat <= 10,
10 < lat <= 20,
20 < lat <= 30
]
binned_longitude(lon) = [
0 < lon <= 15,
15 < lon <= 30
]
binned_latitude_X_longitude(lat, lon) = [
0 < lat <= 10 AND 0 < lon <= 15,
0 < lat <= 10 AND 15 < lon <= 30,
10 < lat <= 20 AND 0 < lon <= 15,
10 < lat <= 20 AND 15 < lon <= 30,
20 < lat <= 30 AND 0 < lon <= 15,
20 < lat <= 30 AND 15 < lon <= 30
]Interactions can also be built from behavioral and temporal features (e.g., [behavior type X time of day] ) to dramatically improve predictive power.
In summary, careful feature engineering—mapping, encoding, scaling, cleaning, and interaction creation—greatly influences the upper bound of model performance, while the choice of algorithm and optimization fine‑tunes the solution.
Thank you for reading.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.