Artificial Intelligence 8 min read

Automating Machine Learning Workflows with Scikit‑Learn Pipelines

This article demonstrates how to build a reproducible fraud‑detection workflow using scikit‑learn's Pipeline class, comparing a manual script with a pipeline‑based approach on the IEEE‑CIS Kaggle dataset and showing the benefits of modular, repeatable ML code.

Code DAO

Jan 1, 2022

Automating Machine Learning Workflows with Scikit‑Learn Pipelines

This article introduces how to use scikit‑learn's Pipeline class to construct a machine‑learning workflow.

Data

The IEEE‑CIS Fraud Detection dataset from Kaggle is used. The data consists of two CSV files—transactions and identities—joined on TransactionID. The files are loaded into pandas and merged:

# load data
train_transaction = pd.read_csv("../data/train_transaction.csv")
train_identity = pd.read_csv("../data/train_identity.csv")
# merge the dataframe
dataframe = pd.merge(train_transaction,
                     train_identity,
                     how="left",
                     on="TransactionID")
dataframe.head()

The merged dataframe is displayed (image omitted).

The target label isFraud defines a binary classification problem. Model performance is evaluated with cross‑validation.

Splitting the Data

X_train, X_test, y_train, y_test = train_test_split(
    dataframe.drop("isFraud", axis=1),
    dataframe["isFraud"],
    test_size=0.33,
    stratify=dataframe["isFraud"],
    random_state=25)
print(f"X_train Shape: {X_train.shape}
X_test shape: {X_test.shape}
"
      f"y_train shape: {y_train.shape}
 y_test shape: {y_test.shape}")

The split yields 395,661 training instances and 194,879 test instances.

Why a Formal Workflow Matters

A typical ML workflow includes data preprocessing, feature engineering, feature selection, modeling, and evaluation. Without a structured pipeline, every change to preprocessing or model configuration requires manual updates across multiple scripts, and deploying new model versions repeats the entire code base.

Code Without a Pipeline

# copy original dataframe
df1_ = dataframe.copy()
# separate features and target
X = df1_[all_features]
y = df1_["isFraud"]
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, stratify=y, random_state=25)
# impute frequent values for high‑cardinality categorical, categorical, and discrete features
imp_most_frequent = MostFrequentImputer(features=impute_freq)
imp_most_frequent.fit(X_train)
X_train = imp_most_frequent.transform(X_train)
X_test = imp_most_frequent.transform(X_test)
# aggregate high‑cardinality categoricals
aggregate_categoricals = AggregateCategorical(features=high_cardinality_cats)
aggregate_categoricals.fit(X_train)
X_train = aggregate_categoricals.transform(X_train)
X_test = aggregate_categoricals.transform(X_test)
# convert categoricals to codes
transform_dtype = CategoryConverter(features=cat_codes_cols)
transform_dtype.fit(X_train)
X_train = transform_dtype.transform(X_train)
X_test = transform_dtype.transform(X_test)
# impute mean for continuous features
imp_mean = MeanImputer(features=continuous_features)
imp_mean.fit(X_train)
X_train = imp_mean.transform(X_train)
X_test = imp_mean.transform(X_test)
# modelling
random_forest = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=25)
random_forest.fit(X_train, y_train)
train_preds = random_forest.predict_proba(X_train)
y_preds = random_forest.predict_proba(X_test)
# evaluate
preds_roc_auc_ = roc_auc_score(y_test, y_preds[:, 1])
train_roc_auc_ = roc_auc_score(y_train, train_preds[:, 1])
print(f"train ROC_AUC:{train_roc_auc_}
test ROC_AUC: {preds_roc_auc_}")

The above steps are listed as:

Copy the retrieved raw dataframe

Split data into training and testing sets

Impute most‑frequent values for high‑cardinality categorical, categorical, and discrete features

Aggregate high‑cardinality features to reduce dimensionality

Convert categorical features to pandas category type and then to integer codes

Impute mean for missing continuous features

Train a Random Forest model and make predictions

Print ROC‑AUC scores

The output (image omitted) shows the ROC‑AUC values for the non‑pipeline approach.

Building a Scikit‑Learn Pipeline

Scikit‑learn provides many classes to support workflow construction. The Pipeline class chains preprocessing steps with the estimator:

fraud_detection_pipe = Pipeline([
    ("most_frequent_imputer", MostFrequentImputer(features=impute_freq)),
    ("aggregate_high_cardinality_features", AggregateCategorical(features=high_cardinality_cats)),
    ("get_categorical_codes", CategoryConverter(features=cat_codes_cols)),
    ("mean_imputer", MeanImputer(features=continuous_features)),
    ("random_forest", RandomForestClassifier(
        n_estimators=100, n_jobs=-1, random_state=25))
])

This definition lists the exact sequence of transformations and the final model.

Prediction with the Pipeline

# copy dataframe for prediction
df2_ = dataframe.copy()
X = df2_[all_features]
y = df2_["isFraud"]
# split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, stratify=y, random_state=25)
# fit pipeline
fraud_detection_pipe.fit(X_train, y_train)
# predict probabilities
train_preds = fraud_detection_pipe.predict_proba(X_train)
y_preds = fraud_detection_pipe.predict_proba(X_test)
# evaluate
preds_roc_auc_ = roc_auc_score(y_test, y_preds[:, 1])
train_roc_auc_ = roc_auc_score(y_train, train_preds[:, 1])
print(f"train ROC_AUC:{train_roc_auc_}
test ROC_AUC: {preds_roc_auc_}")

The resulting ROC‑AUC scores (image omitted) are higher and the code is considerably shorter, illustrating the benefits of a pipeline.

Reproducibility Challenge

Ensuring reproducibility—being able to recreate the exact model—is a key challenge in ML workflow development. Pipelines encapsulate all preprocessing and modeling steps, making experiments repeatable.

Conclusion

The article explains the importance of automating machine‑learning workflows and provides a concrete example of building such a workflow with scikit‑learn's Pipeline class on a fraud‑detection dataset.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning Python fraud detection pipeline scikit-learn

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.