Artificial Intelligence 17 min read

Feature Engineering: Concepts, Methods, and Automation

Feature engineering transforms existing data into new predictive variables through manual analysis or automated pipelines, encompassing single‑variable encoding, pairwise arithmetic, group‑statistics, multi‑variable combinations, time‑series and text derivations, with tools like Deep Feature Synthesis and beam‑search to generate and select useful features.

HelloTech

Mar 14, 2024

Feature Engineering: Concepts, Methods, and Automation

Feature derivation (feature engineering) refers to creating new features from existing data. It can be divided into two main approaches: manual feature derivation, which relies on deep business and data analysis to synthesize interpretable fields, and batch (automated) feature derivation, which uses engineering tricks to generate a large pool of features and then selects useful ones.

The essence of feature derivation is the re‑arrangement of existing information. Manual methods start from ideas, analyze business background or data distribution, and then create features, while batch methods focus on systematic column‑wise transformations to produce as many features as possible.

Feature derivation methods are typically grouped into four categories: single‑variable derivation, double‑variable (or pairwise) derivation, key‑feature derivation, and multi‑variable derivation.

Single‑variable derivation includes data re‑encoding techniques such as:

Normalization (0‑1 scaling, Z‑Score)

Discretization (equal‑width, equal‑frequency, clustering bins)

Dictionary encoding and one‑hot encoding for categorical variables

Embedding encoding for high‑cardinality IDs, including graph‑based embeddings

High‑order polynomial features can also be generated from a single variable (e.g., square, cube).

Double‑variable derivation often uses arithmetic operations between two columns. Typical use cases are:

Creating business supplement fields (e.g., total daily spend)

Combining all derived features after a full derivation pipeline

Special competition features such as golden‑combination or traffic‑smoothing features

Example code for common arithmetic derivations:

df['cost'] / (df['mean'] + 1e-5)

df['cost'] - df['mean']

(df['cost'] - df['mean']) / (np.sqrt(df['var']) + 1e-5)

Cross‑combination features are created by pairing categorical levels; the number of resulting features grows exponentially with the number of variables and levels, so they should be used judiciously.

Group‑statistics derivation aggregates a target variable based on the values of a grouping key (e.g., mean, variance, median). Example code:

def q1(x):
    """下四分位数"""
    return x.quantile(0.25)

def q2(x):
    """上四分位数"""
    return x.quantile(0.75)

d1 = pd.DataFrame({'x1':[3,2,4,4,2,2], 'x2':[0,1,1,0,0,0]})
aggs = {'x1':[q1, q2]}
d2 = d1.groupby('x2').agg(aggs).reset_index()

After group statistics, further arithmetic can be applied, e.g., flow‑smoothing, golden‑combination, or intra‑group normalization.

Multi‑variable derivation extends double‑variable techniques to three or more columns, including cross‑combination, group‑statistics, and higher‑order polynomial features. The same caution about feature explosion applies.

Automated feature derivation includes algorithms such as Deep Feature Synthesis (DFS) used by FeatureTools. DFS builds a relational graph of tables, follows relationship paths, and applies aggregation functions to generate features automatically. Relation types include forward (one‑to‑one) and backward (one‑to‑many) links.

Beam Search is employed by AutoCross to greedily generate higher‑order features from promising second‑order combinations, followed by feature selection methods like field‑wise logistic regression.

Time‑series feature derivation extracts temporal information from datetime fields (year, month, day, hour, minute, second) and natural cycles (quarter, week‑of‑year, day‑of‑week, weekend flag, time‑of‑day bucket). Example code:

(t['dayofweek'] > 5).astype(int)

(t['hour'] // 6).astype(int)

Additional derived features can measure the distance to key timestamps (e.g., days since registration).

Text feature derivation typically starts with word‑vector conversion such as bag‑of‑words or TF‑IDF, turning each document into a numeric vector for downstream modeling.

The general automated feature‑derivation workflow consists of four steps: data re‑encoding, single‑variable derivation, cross‑combination derivation, and group‑statistics derivation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

feature engineering Data preprocessing time series automated features feature derivation text features

Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.