Artificial Intelligence 13 min read

10 Essential Machine Learning Engineering Tips Every Data Scientist Should Know

This article shares ten practical Python‑focused machine‑learning engineering tips—from writing abstract classes and fixing random seeds to tracking progress, speeding up pandas, managing cloud costs, and building robust FastAPI services—helping developers write cleaner, more reproducible, and production‑ready code.

MaGe Linux Operations

Sep 12, 2020

10 Essential Machine Learning Engineering Tips Every Data Scientist Should Know

Sometimes data scientists forget their original role as developers; the primary duty is to quickly deliver bug‑free solutions.

Being able to build models does not make you a god, and it is no excuse for writing sloppy code.

The author shares the most frequently used skills in machine‑learning engineering, presented as ten actionable tips.

Learning to Write Abstract Classes

Using abstract classes enforces consistent method names across subclasses, preventing chaos in collaborative projects.

import os
from abc import ABCMeta, abstractmethod

class DataProcessor(metaclass=ABCMeta):
    """Base processor to be used for all preparation."""
    def __init__(self, input_directory, output_directory):
        self.input_directory = input_directory
        self.output_directory = output_directory

    @abstractmethod
    def read(self):
        """Read raw data."""

    @abstractmethod
    def process(self):
        """Processes raw data. This step should create the raw dataframe with all the required features. Shouldn't implement statistical or text cleaning."""

    @abstractmethod
    def save(self):
        """Saves processed data."""

class Trainer(metaclass=ABCMeta):
    """Base trainer to be used for all models."""
    def __init__(self, directory):
        self.directory = directory
        self.model_directory = os.path.join(directory, 'models')

    @abstractmethod
    def preprocess(self):
        """This takes the preprocessed data and returns clean data. This is more about statistical or text cleaning."""

    @abstractmethod
    def set_model(self):
        """Define model here."""

    @abstractmethod
    def fit_model(self):
        """This takes the vectorised data and returns a trained model."""

    @abstractmethod
    def generate_metrics(self):
        """Generates metric with trained model and test data."""

    @abstractmethod
    def save_model(self, model_name):
        """This method saves the model in our required format."""

class Predict(metaclass=ABCMeta):
    """Base predictor to be used for all models."""
    def __init__(self, directory):
        self.directory = directory
        self.model_directory = os.path.join(directory, 'models')

    @abstractmethod
    def load_model(self):
        """Load model here."""

    @abstractmethod
    def preprocess(self):
        """This takes the raw data and returns clean data for prediction."""

    @abstractmethod
    def predict(self):
        """This is used for prediction."""

class BaseDB(metaclass=ABCMeta):
    """ Base database class to be used for all DB connectors."""
    @abstractmethod
    def get_connection(self):
        """This creates a new DB connection."""

    @abstractmethod
    def close_connection(self):
        """This closes the DB connection."""

Fix Random Seed

Reproducibility requires setting random seeds for all libraries; otherwise training splits and weight initializations differ.

def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)

Load Small Amount of Data

When data is large, use the nrows argument to read only a subset for quick testing.

f_train = pd.read_csv('train.csv', nrows=1000)

Predict Failure (Sign of a Mature Developer)

Always check for missing values; they can cause hidden bugs later.

print(len(df))
df.isna().sum()
df.dropna()
print(len(df))

Show Processing Progress

Use tqdm or fastprogress to visualize long‑running operations.

from tqdm import tqdm
import time

tqdm.pandas()

df['col'] = df['col'].progress_apply(lambda x: x**2)

text = ""
for char in tqdm(["a", "b", "c", "d"]):
    time.sleep(0.25)
    text = text + char

Option 2: fastprogress

from fastprogress.fastprogress import master_bar, progress_bar
from time import sleep
mb = master_bar(range(10))
for i in mb:
    for j in progress_bar(range(100), parent=mb):
        sleep(0.01)
        mb.child.comment = 'second bar stat'
    mb.first_bar.comment = 'first bar stat'
    mb.write(f'Finished loop {i}.')

Solve Pandas Slow Problem

Replace pandas with modin.pandas for a drop‑in speed boost.

import modin.pandas as pd

Record Function Execution Time

Use a decorator to log how long functions take, helping spot hidden performance issues.

import time
from functools import wraps

def timing(f):
    """Decorator for timing functions
    Usage:
    @timing
    def function(a):
        pass
    """
    @wraps(f)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = f(*args, **kwargs)
        end = time.time()
        print('function:%r took: %2.2f sec' % (f.__name__, end - start))
        return result
    return wrapper

Don't Burn Money on Cloud

Wrap the main routine in try/except and shut down the instance after completion to avoid unnecessary cloud costs.

import os

def run_command(cmd):
    return os.system(cmd)

def shutdown(seconds=0, os='linux'):
    """Shutdown system after seconds given. Useful for shutting EC2 to save costs."""
    if os == 'linux':
        run_command('sudo shutdown -h -t sec %s' % seconds)
    elif os == 'windows':
        run_command('shutdown -s -t %s' % seconds)

Create and Save Report

After modeling, generate and store metrics and classification reports for stakeholders.

import json
import os
from sklearn.metrics import (accuracy_score, classification_report,
                             confusion_matrix, f1_score, fbeta_score)

def get_metrics(y, y_pred, beta=2, average_method='macro', y_encoder=None):
    if y_encoder:
        y = y_encoder.inverse_transform(y)
        y_pred = y_encoder.inverse_transform(y_pred)
    return {
        'accuracy': round(accuracy_score(y, y_pred), 4),
        'f1_score_macro': round(f1_score(y, y_pred, average=average_method), 4),
        'fbeta_score_macro': round(fbeta_score(y, y_pred, beta, average=average_method), 4),
        'report': classification_report(y, y_pred, output_dict=True),
        'report_csv': classification_report(y, y_pred, output_dict=False).replace('
','
')
    }

def save_metrics(metrics: dict, model_directory, file_name):
    path = os.path.join(model_directory, file_name + '_report.txt')
    classification_report_to_csv(metrics['report_csv'], path)
    metrics.pop('report_csv')
    path = os.path.join(model_directory, file_name + '_metrics.json')
    json.dump(metrics, open(path, 'w'), indent=4)

Write a Good API

For classic ML and deep‑learning deployments under moderate load, combine FastAPI, Uvicorn, and Gunicorn.

The fastest way to write an API is with FastAPI, which auto‑generates interactive docs at /docs.

Gunicorn workers allow multiple processes; keep at least two workers for reliability.

Deploy with four workers using the command below and adjust based on load testing.

pip install fastapi uvicorn gunicorn
gunicorn -w 4 -k uvicorn.workers.UvicornH11Worker main:app

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning best practices API Development Cloud Cost Abstract Classes random seed

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.