10 Essential Machine Learning Engineering Tips Every Data Scientist Should Know
This article shares ten practical Python‑focused machine‑learning engineering tips—from writing abstract classes and fixing random seeds to tracking progress, speeding up pandas, managing cloud costs, and building robust FastAPI services—helping developers write cleaner, more reproducible, and production‑ready code.
Sometimes data scientists forget their original role as developers; the primary duty is to quickly deliver bug‑free solutions.
Being able to build models does not make you a god, and it is no excuse for writing sloppy code.
The author shares the most frequently used skills in machine‑learning engineering, presented as ten actionable tips.
Learning to Write Abstract Classes
Using abstract classes enforces consistent method names across subclasses, preventing chaos in collaborative projects.
import os
from abc import ABCMeta, abstractmethod
class DataProcessor(metaclass=ABCMeta):
"""Base processor to be used for all preparation."""
def __init__(self, input_directory, output_directory):
self.input_directory = input_directory
self.output_directory = output_directory
@abstractmethod
def read(self):
"""Read raw data."""
@abstractmethod
def process(self):
"""Processes raw data. This step should create the raw dataframe with all the required features. Shouldn't implement statistical or text cleaning."""
@abstractmethod
def save(self):
"""Saves processed data."""
class Trainer(metaclass=ABCMeta):
"""Base trainer to be used for all models."""
def __init__(self, directory):
self.directory = directory
self.model_directory = os.path.join(directory, 'models')
@abstractmethod
def preprocess(self):
"""This takes the preprocessed data and returns clean data. This is more about statistical or text cleaning."""
@abstractmethod
def set_model(self):
"""Define model here."""
@abstractmethod
def fit_model(self):
"""This takes the vectorised data and returns a trained model."""
@abstractmethod
def generate_metrics(self):
"""Generates metric with trained model and test data."""
@abstractmethod
def save_model(self, model_name):
"""This method saves the model in our required format."""
class Predict(metaclass=ABCMeta):
"""Base predictor to be used for all models."""
def __init__(self, directory):
self.directory = directory
self.model_directory = os.path.join(directory, 'models')
@abstractmethod
def load_model(self):
"""Load model here."""
@abstractmethod
def preprocess(self):
"""This takes the raw data and returns clean data for prediction."""
@abstractmethod
def predict(self):
"""This is used for prediction."""
class BaseDB(metaclass=ABCMeta):
""" Base database class to be used for all DB connectors."""
@abstractmethod
def get_connection(self):
"""This creates a new DB connection."""
@abstractmethod
def close_connection(self):
"""This closes the DB connection."""Fix Random Seed
Reproducibility requires setting random seeds for all libraries; otherwise training splits and weight initializations differ.
def set_seed(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.n_gpu > 0:
torch.cuda.manual_seed_all(args.seed)Load Small Amount of Data
When data is large, use the nrows argument to read only a subset for quick testing.
f_train = pd.read_csv('train.csv', nrows=1000)Predict Failure (Sign of a Mature Developer)
Always check for missing values; they can cause hidden bugs later.
print(len(df))
df.isna().sum()
df.dropna()
print(len(df))Show Processing Progress
Use tqdm or fastprogress to visualize long‑running operations.
from tqdm import tqdm
import time
tqdm.pandas()
df['col'] = df['col'].progress_apply(lambda x: x**2)
text = ""
for char in tqdm(["a", "b", "c", "d"]):
time.sleep(0.25)
text = text + charOption 2: fastprogress
from fastprogress.fastprogress import master_bar, progress_bar
from time import sleep
mb = master_bar(range(10))
for i in mb:
for j in progress_bar(range(100), parent=mb):
sleep(0.01)
mb.child.comment = 'second bar stat'
mb.first_bar.comment = 'first bar stat'
mb.write(f'Finished loop {i}.')Solve Pandas Slow Problem
Replace pandas with modin.pandas for a drop‑in speed boost.
import modin.pandas as pdRecord Function Execution Time
Use a decorator to log how long functions take, helping spot hidden performance issues.
import time
from functools import wraps
def timing(f):
"""Decorator for timing functions
Usage:
@timing
def function(a):
pass
"""
@wraps(f)
def wrapper(*args, **kwargs):
start = time.time()
result = f(*args, **kwargs)
end = time.time()
print('function:%r took: %2.2f sec' % (f.__name__, end - start))
return result
return wrapperDon't Burn Money on Cloud
Wrap the main routine in try/except and shut down the instance after completion to avoid unnecessary cloud costs.
import os
def run_command(cmd):
return os.system(cmd)
def shutdown(seconds=0, os='linux'):
"""Shutdown system after seconds given. Useful for shutting EC2 to save costs."""
if os == 'linux':
run_command('sudo shutdown -h -t sec %s' % seconds)
elif os == 'windows':
run_command('shutdown -s -t %s' % seconds)Create and Save Report
After modeling, generate and store metrics and classification reports for stakeholders.
import json
import os
from sklearn.metrics import (accuracy_score, classification_report,
confusion_matrix, f1_score, fbeta_score)
def get_metrics(y, y_pred, beta=2, average_method='macro', y_encoder=None):
if y_encoder:
y = y_encoder.inverse_transform(y)
y_pred = y_encoder.inverse_transform(y_pred)
return {
'accuracy': round(accuracy_score(y, y_pred), 4),
'f1_score_macro': round(f1_score(y, y_pred, average=average_method), 4),
'fbeta_score_macro': round(fbeta_score(y, y_pred, beta, average=average_method), 4),
'report': classification_report(y, y_pred, output_dict=True),
'report_csv': classification_report(y, y_pred, output_dict=False).replace('
','
')
}
def save_metrics(metrics: dict, model_directory, file_name):
path = os.path.join(model_directory, file_name + '_report.txt')
classification_report_to_csv(metrics['report_csv'], path)
metrics.pop('report_csv')
path = os.path.join(model_directory, file_name + '_metrics.json')
json.dump(metrics, open(path, 'w'), indent=4)Write a Good API
For classic ML and deep‑learning deployments under moderate load, combine FastAPI, Uvicorn, and Gunicorn.
The fastest way to write an API is with FastAPI, which auto‑generates interactive docs at /docs.
Gunicorn workers allow multiple processes; keep at least two workers for reliability.
Deploy with four workers using the command below and adjust based on load testing.
pip install fastapi uvicorn gunicorn
gunicorn -w 4 -k uvicorn.workers.UvicornH11Worker main:appSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
