Artificial Intelligence 17 min read

Feature Engineering: Concepts, Methods, and Automation

Feature engineering transforms existing data into new predictive variables through manual analysis or automated pipelines, encompassing single‑variable encoding, pairwise arithmetic, group‑statistics, multi‑variable combinations, time‑series and text derivations, with tools like Deep Feature Synthesis and beam‑search to generate and select useful features.

HelloTech
HelloTech
HelloTech
Feature Engineering: Concepts, Methods, and Automation

Feature derivation (feature engineering) refers to creating new features from existing data. It can be divided into two main approaches: manual feature derivation, which relies on deep business and data analysis to synthesize interpretable fields, and batch (automated) feature derivation, which uses engineering tricks to generate a large pool of features and then selects useful ones.

The essence of feature derivation is the re‑arrangement of existing information. Manual methods start from ideas, analyze business background or data distribution, and then create features, while batch methods focus on systematic column‑wise transformations to produce as many features as possible.

Feature derivation methods are typically grouped into four categories: single‑variable derivation, double‑variable (or pairwise) derivation, key‑feature derivation, and multi‑variable derivation.

Single‑variable derivation includes data re‑encoding techniques such as:

Normalization (0‑1 scaling, Z‑Score)

Discretization (equal‑width, equal‑frequency, clustering bins)

Dictionary encoding and one‑hot encoding for categorical variables

Embedding encoding for high‑cardinality IDs, including graph‑based embeddings

High‑order polynomial features can also be generated from a single variable (e.g., square, cube).

Double‑variable derivation often uses arithmetic operations between two columns. Typical use cases are:

Creating business supplement fields (e.g., total daily spend)

Combining all derived features after a full derivation pipeline

Special competition features such as golden‑combination or traffic‑smoothing features

Example code for common arithmetic derivations:

df['cost'] / (df['mean'] + 1e-5)

df['cost'] - df['mean']

(df['cost'] - df['mean']) / (np.sqrt(df['var']) + 1e-5)

Cross‑combination features are created by pairing categorical levels; the number of resulting features grows exponentially with the number of variables and levels, so they should be used judiciously.

Group‑statistics derivation aggregates a target variable based on the values of a grouping key (e.g., mean, variance, median). Example code:

def q1(x): """下四分位数""" return x.quantile(0.25) def q2(x): """上四分位数""" return x.quantile(0.75) d1 = pd.DataFrame({'x1':[3,2,4,4,2,2], 'x2':[0,1,1,0,0,0]}) aggs = {'x1':[q1, q2]} d2 = d1.groupby('x2').agg(aggs).reset_index()

After group statistics, further arithmetic can be applied, e.g., flow‑smoothing, golden‑combination, or intra‑group normalization.

Multi‑variable derivation extends double‑variable techniques to three or more columns, including cross‑combination, group‑statistics, and higher‑order polynomial features. The same caution about feature explosion applies.

Automated feature derivation includes algorithms such as Deep Feature Synthesis (DFS) used by FeatureTools. DFS builds a relational graph of tables, follows relationship paths, and applies aggregation functions to generate features automatically. Relation types include forward (one‑to‑one) and backward (one‑to‑many) links.

Beam Search is employed by AutoCross to greedily generate higher‑order features from promising second‑order combinations, followed by feature selection methods like field‑wise logistic regression.

Time‑series feature derivation extracts temporal information from datetime fields (year, month, day, hour, minute, second) and natural cycles (quarter, week‑of‑year, day‑of‑week, weekend flag, time‑of‑day bucket). Example code:

(t['dayofweek'] > 5).astype(int)

(t['hour'] // 6).astype(int)

Additional derived features can measure the distance to key timestamps (e.g., days since registration).

Text feature derivation typically starts with word‑vector conversion such as bag‑of‑words or TF‑IDF, turning each document into a numeric vector for downstream modeling.

The general automated feature‑derivation workflow consists of four steps: data re‑encoding, single‑variable derivation, cross‑combination derivation, and group‑statistics derivation.

machine learningFeature Engineeringdata preprocessingtime seriesautomated featuresfeature derivationtext features
HelloTech
Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.