Artificial Intelligence 13 min read

Using scikit-learn for Data Mining: Feature Engineering, Parallel Processing, Pipelines, and Model Persistence

This article demonstrates how to perform data mining with scikit-learn by detailing the full workflow—from data acquisition and feature engineering, through parallel and pipeline processing, to automated hyper‑parameter tuning and model persistence—using the Iris dataset as an example.

Python Programming Learning Circle

Jun 21, 2024

Using scikit-learn for Data Mining: Feature Engineering, Parallel Processing, Pipelines, and Model Persistence

Data mining typically involves data acquisition, analysis, feature engineering, model training, and evaluation. The article uses scikit-learn (sklearn) to illustrate these steps on a modified Iris dataset, emphasizing the library's design that separates fit, transform, and fit_transform methods.

Transformations are categorized as no‑information (e.g., log, exponential), unsupervised (e.g., standardization, PCA), and supervised (e.g., LDA, feature selection). Only supervised transformations implement a useful fit method that extracts statistics from both features and target values.

Parallel processing can be performed either on the whole feature matrix (overall parallelism) or on selected columns (partial parallelism). sklearn provides FeatureUnion for overall parallelism, while a custom FeatureUnionExt class extends this capability to partial parallelism by specifying column indices.

Example code for overall parallel processing:

from numpy import log1p
from sklearn.preprocessing import FunctionTransformer, Binarizer
from sklearn.pipeline import FeatureUnion

step2_1 = ('ToLog', FunctionTransformer(log1p))
step2_2 = ('ToBinary', Binarizer())
step2 = ('FeatureUnion', FeatureUnion(transformer_list=[step2_1, step2_2]))

Example code for the custom partial parallel class:

from sklearn.pipeline import FeatureUnion, _fit_one_transformer, _fit_transform_one, _transform_one
from sklearn.externals.joblib import Parallel, delayed
from scipy import sparse
import numpy as np

class FeatureUnionExt(FeatureUnion):
    def __init__(self, transformer_list, idx_list, n_jobs=1, transformer_weights=None):
        self.idx_list = idx_list
        super().__init__(transformer_list=map(lambda trans: (trans[0], trans[1]), transformer_list), n_jobs=n_jobs, transformer_weights=transformer_weights)
    def fit(self, X, y=None):
        transformer_idx_list = map(lambda trans, idx: (trans[0], trans[1], idx), self.transformer_list, self.idx_list)
        transformers = Parallel(n_jobs=self.n_jobs)(delayed(_fit_one_transformer)(trans, X[:, idx], y) for name, trans, idx in transformer_idx_list)
        self._update_transformer_list(transformers)
        return self
    def fit_transform(self, X, y=None, **fit_params):
        transformer_idx_list = map(lambda trans, idx: (trans[0], trans[1], idx), self.transformer_list, self.idx_list)
        result = Parallel(n_jobs=self.n_jobs)(delayed(_fit_transform_one)(trans, name, X[:, idx], y, self.transformer_weights, **fit_params) for name, trans, idx in transformer_idx_list)
        Xs, transformers = zip(*result)
        self._update_transformer_list(transformers)
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = np.hstack(Xs)
        return Xs
    def transform(self, X):
        transformer_idx_list = map(lambda trans, idx: (trans[0], trans[1], idx), self.transformer_list, self.idx_list)
        Xs = Parallel(n_jobs=self.n_jobs)(delayed(_transform_one)(trans, name, X[:, idx], self.transformer_weights) for name, trans, idx in transformer_idx_list)
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = np.hstack(Xs)
        return Xs

Using these parallel tools, the article builds a complete pipeline that includes missing‑value imputation, mixed feature processing (one‑hot encoding, log transformation, binarization), scaling, chi‑square feature selection, PCA dimensionality reduction, and a logistic‑regression classifier:

from numpy import log1p
from sklearn.preprocessing import Imputer, OneHotEncoder, FunctionTransformer, Binarizer, MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

step1 = ('Imputer', Imputer())
step2_1 = ('OneHotEncoder', OneHotEncoder(sparse=False))
step2_2 = ('ToLog', FunctionTransformer(log1p))
step2_3 = ('ToBinary', Binarizer())
step2 = ('FeatureUnionExt', FeatureUnionExt(transformer_list=[step2_1, step2_2, step2_3], idx_list=[[0], [1,2,3], [4]]))
step3 = ('MinMaxScaler', MinMaxScaler())
step4 = ('SelectKBest', SelectKBest(chi2, k=3))
step5 = ('PCA', PCA(n_components=2))
step6 = ('LogisticRegression', LogisticRegression(penalty='l2'))
pipeline = Pipeline(steps=[step1, step2, step3, step4, step5, step6])

For automated hyper‑parameter tuning, the article employs GridSearchCV to search over the binarizer threshold and logistic‑regression regularization parameter:

from sklearn.grid_search import GridSearchCV

grid_search = GridSearchCV(pipeline, param_grid={
    'FeatureUnionExt__ToBinary__threshold': [1.0, 2.0, 3.0, 4.0],
    'LogisticRegression__C': [0.1, 0.2, 0.4, 0.8]
})
grid_search.fit(iris.data, iris.target)

Model persistence is achieved with joblib.dump and joblib.load, allowing the trained grid search object to be saved to disk and reloaded without retraining:

# Save the trained object
from sklearn.externals import joblib
joblib.dump(grid_search, 'grid_search.dmp', compress=3)

# Load it later
grid_search = joblib.load('grid_search.dmp')

The article notes that objects containing lambda functions cannot be pickled, which is a limitation to keep in mind when persisting scikit‑learn pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data mining feature engineering parallel processing pipeline Scikit-learn hyperparameter tuning model persistence

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.