19 Elegant Sklearn Tricks for More Efficient Machine Learning

This article presents 19 practical Sklearn functions—ranging from outlier detection to hyper‑parameter search—that replace manual data‑science steps, each illustrated with concise code examples and performance comparisons.

Data STUDIO
Data STUDIO
Data STUDIO
19 Elegant Sklearn Tricks for More Efficient Machine Learning

1️⃣ EllipticEnvelope (covariance)

Detects outliers in Gaussian‑distributed data. Example creates a normal distribution (mean 5, std 2) and uses EllipticEnvelope to predict inliers (1) and outliers (-1) for values 20, 10, 13.

import numpy as np
from sklearn.covariance import EllipticEnvelope
X = np.random.normal(loc=5, scale=2, size=50).reshape(-1,1)
ee = EllipticEnvelope(random_state=0)
_ = ee.fit(X)
test = np.array([6,8,20,4,5,6,10,13]).reshape(-1,1)
print(ee.predict(test))
# array([ 1,  1, -1,  1,  1,  1, -1, -1])

2️⃣ RFECV (feature_selection)

Recursive feature elimination with cross‑validation automatically discards irrelevant features. Using a synthetic regression dataset (10 informative of 15 features) and a Ridge estimator, RFECV reduces the feature matrix from 15 to 10 columns.

from sklearn.datasets import make_regression
from sklearn.feature_selection import RFECV
from sklearn.linear_model import Ridge
X, y = make_regression(n_samples=10000, n_features=15, n_informative=10)
rfecv = RFECV(estimator=Ridge(), cv=5)
_ = rfecv.fit(X, y)
print(rfecv.transform(X).shape)  # (10000, 10)

3️⃣ ExtraTrees (ensemble)

ExtraTreesRegressor offers a more random forest‑like alternative that reduces variance by selecting random split thresholds for each feature. On a synthetic regression set, ExtraTrees achieves a mean cross‑validation score of 0.84 versus 0.64 for a standard RandomForest.

from sklearn.datasets import make_regression
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
X, y = make_regression(n_samples=10000, n_features=20)
clf = DecisionTreeRegressor(random_state=0)
print(cross_val_score(clf, X, y, cv=5).mean())
clf = RandomForestRegressor(n_estimators=10, random_state=0)
print(cross_val_score(clf, X, y, cv=5).mean())
clf = ExtraTreesRegressor(n_estimators=10, random_state=0)
print(cross_val_score(clf, X, y, cv=5).mean())

4️⃣ IterativeImputer & KNNImputer (impute)

For missing‑value handling beyond SimpleImputer, KNNImputer fills gaps using the k‑nearest‑neighbors average, while IterativeImputer models each feature as a function of the others (e.g., with BayesianRidge) and iteratively predicts missing entries.

from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.linear_model import BayesianRidge
X = [[1,2,np.nan],[3,4,3],[np.nan,6,5],[8,8,7]]
knn = KNNImputer(n_neighbors=2)
print(knn.fit_transform(X))
imp = IterativeImputer(estimator=BayesianRidge())
print(imp.fit_transform(X))

5️⃣ HuberRegressor (linear_model)

Provides robust regression by down‑weighting outliers via an epsilon hyper‑parameter. Compared with Bayesian Ridge on a dataset containing extreme values, HuberRegressor with epsilon ≈ 1.5 yields a fit that is less influenced by the outliers.

6️⃣ plot_tree (tree)

Visualizes a single decision tree structure, useful for beginners learning tree‑based models. The function works with any fitted tree estimator, such as DecisionTreeClassifier on the Iris dataset.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
iris = load_iris()
clf = DecisionTreeClassifier().fit(iris.data, iris.target)
plt.figure(figsize=(15,10), dpi=200)
plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names)

7️⃣ Perceptron (linear_model)

A simple linear binary classifier that updates only on misclassifications, equivalent to SGDClassifier(loss='perceptron') but slightly faster. Demonstrated on a synthetic 100 k‑sample classification problem with a score of 0.92.

from sklearn.datasets import make_classification
from sklearn.linear_model import Perceptron
X, y = make_classification(n_samples=100000, n_features=20, n_classes=2)
clf = Perceptron()
clf.fit(X, y)
print(clf.score(X, y))

8️⃣ SelectFromModel (feature_selection)

A lightweight wrapper that selects features based on .feature_importances_ or .coef_. Using ExtraTreesRegressor on a dataset with 40 redundant features reduces the dimensionality to 8 important features.

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesRegressor
X, y = make_regression(n_samples=10000, n_features=50, n_informative=10)
selector = SelectFromModel(ExtraTreesRegressor()).fit(X, y)
print(selector.transform(X).shape)  # (10000, 8)

9️⃣ ConfusionMatrixDisplay (metrics)

Shows the default confusion matrix for binary classification and allows custom label ordering via ConfusionMatrixDisplay. Example builds a binary dataset, fits an ExtraTreeClassifier, and visualizes both the default and a custom‑ordered matrix.

from sklearn.metrics import plot_confusion_matrix, ConfusionMatrixDisplay, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200, n_features=5, n_classes=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1121218)
clf = ExtraTreeClassifier().fit(X_train, y_train)
fig, ax = plt.subplots(figsize=(5,4))
plot_confusion_matrix(clf, X_test, y_test, ax=ax)
cm = confusion_matrix(y_test, clf.predict(X_test))
Disp = ConfusionMatrixDisplay(cm, display_labels=["Positive","Negative"])
Disp.plot(ax=ax)

🔟 Generalized Linear Models (linear_model)

Sklearn provides PoissonRegressor, TweedieRegressor, and GammaRegressor for targets following non‑Gaussian distributions, offering robust alternatives to ordinary least squares when the response variable matches these families.

1️⃣1️⃣ IsolationForest (ensemble)

Detects anomalies by constructing extremely random trees; samples that traverse short paths across many trees are flagged as outliers. Demonstrated on a small array where the value 90 is correctly labeled -1.

from sklearn.ensemble import IsolationForest
import numpy as np
X = np.array([-1.1, 0.3, 0.5, 100]).reshape(-1,1)
clf = IsolationForest(random_state=0).fit(X)
print(clf.predict([[0.1],[0],[90]]))  # [1, 1, -1]

1️⃣2️⃣ PowerTransformer (preprocessing)

Applies a log‑based power transform to make skewed features more Gaussian. Illustrated on the Seaborn diamonds dataset, where the highly skewed price and carat columns become approximately normal after transformation.

from sklearn.preprocessing import PowerTransformer
import seaborn as sns
diamonds = sns.load_dataset("diamonds")
pt = PowerTransformer()
cols = ["price","carat"]
diamonds[cols] = pt.fit_transform(diamonds[cols])

1️⃣3️⃣ RobustScaler (preprocessing)

Scales features using median and inter‑quartile range, making it resistant to outliers. Suitable when data contain extreme values that would distort mean‑based scaling.

1️⃣4️⃣ make_column_transformer (compose)

Creates a ColumnTransformer without manually naming each step. Example combines StandardScaler for numeric columns and OneHotEncoder for categorical columns from the diamonds dataset.

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd, numpy as np
X = diamonds.drop("price", axis=1)
num_cols = X.select_dtypes(include=np.number).columns
cat_cols = X.select_dtypes(exclude=np.number).columns
transformer = make_column_transformer((StandardScaler(), num_cols), (OneHotEncoder(), cat_cols))

1️⃣5️⃣ make_column_selector (compose)

Generates column selectors based on dtype inclusion/exclusion, simplifying the creation of ColumnTransformer pipelines without explicit column name lists.

from sklearn.compose import make_column_transformer, make_column_selector
transformer = make_column_transformer(
    (StandardScaler(), make_column_selector(dtype_include=np.number)),
    (OneHotEncoder(), make_column_selector(dtype_exclude=np.number))
)

1️⃣6️⃣ OrdinalEncoder (preprocessing)

Encodes ordered categorical features into integer values (0 to n‑1) and can be applied to multiple columns in a single call, unlike LabelEncoder which works only on a single target array.

from sklearn.preprocessing import OrdinalEncoder
X = [["class_1","rank_1"],["class_1","rank_3"],["class_3","rank_3"],["class_2","rank_2"]]
enc = OrdinalEncoder()
print(enc.fit_transform(X))
# [[0.,0.],[0.,2.],[2.,2.],[1.,1.]]

1️⃣7️⃣ get_scorer (metrics)

Retrieves any scoring function by name without importing the metric directly, avoiding namespace pollution. Examples show fetching negative mean‑squared error, macro recall, and negative log loss.

from sklearn.metrics import get_scorer
print(get_scorer("neg_mean_squared_error"))
print(get_scorer("recall_macro"))
print(get_scorer("neg_log_loss"))

1️⃣8️⃣ HalvingGridSearchCV & HalvingRandomSearchCV (model_selection)

Experimental hyper‑parameter optimizers that iteratively halve the candidate pool while increasing the training data size, achieving up to 11× speed‑up over exhaustive grid search. The author reports experimental results confirming the acceleration.

1️⃣9️⃣ sklearn.utils (utils)

The sklearn.utils subpackage offers many helper functions such as class_weight.compute_class_weight, estimator_html_repr, shuffle, and check_X_y, useful for building custom estimators that follow the Sklearn API.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Model Evaluationdata preprocessingPipelinefeature selectionscikit-learnhyperparameter tuning
Data STUDIO
Written by

Data STUDIO

Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.