Using scikit-learn for Data Mining: Feature Engineering, Parallel Processing, Pipelines, and Model Persistence
This article demonstrates how to perform data mining with scikit-learn by detailing the full workflow—from data acquisition and feature engineering, through parallel and pipeline processing, to automated hyper‑parameter tuning and model persistence—using the Iris dataset as an example.
Data mining typically involves data acquisition, analysis, feature engineering, model training, and evaluation. The article uses scikit-learn (sklearn) to illustrate these steps on a modified Iris dataset, emphasizing the library's design that separates fit , transform , and fit_transform methods.
Transformations are categorized as no‑information (e.g., log, exponential), unsupervised (e.g., standardization, PCA), and supervised (e.g., LDA, feature selection). Only supervised transformations implement a useful fit method that extracts statistics from both features and target values.
Parallel processing can be performed either on the whole feature matrix (overall parallelism) or on selected columns (partial parallelism). sklearn provides FeatureUnion for overall parallelism, while a custom FeatureUnionExt class extends this capability to partial parallelism by specifying column indices.
Example code for overall parallel processing:
from numpy import log1p
from sklearn.preprocessing import FunctionTransformer, Binarizer
from sklearn.pipeline import FeatureUnion
step2_1 = ('ToLog', FunctionTransformer(log1p))
step2_2 = ('ToBinary', Binarizer())
step2 = ('FeatureUnion', FeatureUnion(transformer_list=[step2_1, step2_2]))Example code for the custom partial parallel class:
from sklearn.pipeline import FeatureUnion, _fit_one_transformer, _fit_transform_one, _transform_one
from sklearn.externals.joblib import Parallel, delayed
from scipy import sparse
import numpy as np
class FeatureUnionExt(FeatureUnion):
def __init__(self, transformer_list, idx_list, n_jobs=1, transformer_weights=None):
self.idx_list = idx_list
super().__init__(transformer_list=map(lambda trans: (trans[0], trans[1]), transformer_list), n_jobs=n_jobs, transformer_weights=transformer_weights)
def fit(self, X, y=None):
transformer_idx_list = map(lambda trans, idx: (trans[0], trans[1], idx), self.transformer_list, self.idx_list)
transformers = Parallel(n_jobs=self.n_jobs)(delayed(_fit_one_transformer)(trans, X[:, idx], y) for name, trans, idx in transformer_idx_list)
self._update_transformer_list(transformers)
return self
def fit_transform(self, X, y=None, **fit_params):
transformer_idx_list = map(lambda trans, idx: (trans[0], trans[1], idx), self.transformer_list, self.idx_list)
result = Parallel(n_jobs=self.n_jobs)(delayed(_fit_transform_one)(trans, name, X[:, idx], y, self.transformer_weights, **fit_params) for name, trans, idx in transformer_idx_list)
Xs, transformers = zip(*result)
self._update_transformer_list(transformers)
if any(sparse.issparse(f) for f in Xs):
Xs = sparse.hstack(Xs).tocsr()
else:
Xs = np.hstack(Xs)
return Xs
def transform(self, X):
transformer_idx_list = map(lambda trans, idx: (trans[0], trans[1], idx), self.transformer_list, self.idx_list)
Xs = Parallel(n_jobs=self.n_jobs)(delayed(_transform_one)(trans, name, X[:, idx], self.transformer_weights) for name, trans, idx in transformer_idx_list)
if any(sparse.issparse(f) for f in Xs):
Xs = sparse.hstack(Xs).tocsr()
else:
Xs = np.hstack(Xs)
return XsUsing these parallel tools, the article builds a complete pipeline that includes missing‑value imputation, mixed feature processing (one‑hot encoding, log transformation, binarization), scaling, chi‑square feature selection, PCA dimensionality reduction, and a logistic‑regression classifier:
from numpy import log1p
from sklearn.preprocessing import Imputer, OneHotEncoder, FunctionTransformer, Binarizer, MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
step1 = ('Imputer', Imputer())
step2_1 = ('OneHotEncoder', OneHotEncoder(sparse=False))
step2_2 = ('ToLog', FunctionTransformer(log1p))
step2_3 = ('ToBinary', Binarizer())
step2 = ('FeatureUnionExt', FeatureUnionExt(transformer_list=[step2_1, step2_2, step2_3], idx_list=[[0], [1,2,3], [4]]))
step3 = ('MinMaxScaler', MinMaxScaler())
step4 = ('SelectKBest', SelectKBest(chi2, k=3))
step5 = ('PCA', PCA(n_components=2))
step6 = ('LogisticRegression', LogisticRegression(penalty='l2'))
pipeline = Pipeline(steps=[step1, step2, step3, step4, step5, step6])For automated hyper‑parameter tuning, the article employs GridSearchCV to search over the binarizer threshold and logistic‑regression regularization parameter:
from sklearn.grid_search import GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid={
'FeatureUnionExt__ToBinary__threshold': [1.0, 2.0, 3.0, 4.0],
'LogisticRegression__C': [0.1, 0.2, 0.4, 0.8]
})
grid_search.fit(iris.data, iris.target)Model persistence is achieved with joblib.dump and joblib.load , allowing the trained grid search object to be saved to disk and reloaded without retraining:
# Save the trained object
from sklearn.externals import joblib
joblib.dump(grid_search, 'grid_search.dmp', compress=3)
# Load it later
grid_search = joblib.load('grid_search.dmp')The article notes that objects containing lambda functions cannot be pickled, which is a limitation to keep in mind when persisting scikit‑learn pipelines.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.