43 Essential Rules for Building Robust Machine Learning Systems
These 43 practical rules, adapted from Martin Zinkevich’s “Rules of ML,” guide engineers through terminology, pipeline design, feature engineering, monitoring, and model deployment, offering actionable advice to avoid common pitfalls and build reliable, scalable machine‑learning‑driven products.
While developing machine‑learning‑based systems in a company we encountered many pitfalls and gradually gathered experience, but lacked a complete framework until we read Martin Zinkevich’s “Rules of ML.” The article is concise, full of guidance for anyone with basic ML knowledge.
Terminology
Instance: The thing about which you want to make a prediction, e.g., a web page to classify as “about cats” or “not about cats.”
Label: The answer for a prediction task – either the model’s output or the ground‑truth label supplied in training data.
Feature: A property of an instance used in a prediction task, e.g., whether a web page contains the word “cat.”
Feature Column: A set of related features, such as all possible countries a user might live in.
Example: An instance together with its features and a label.
Model: A statistical representation of a prediction task, trained on examples and then used for inference.
Metric: A number you care about; it may or may not be directly optimizable.
Objective: The metric your algorithm tries to optimize.
Pipeline: The infrastructure surrounding a machine‑learning algorithm, including data collection, training, and serving.
Overview
To make great products, do machine learning like a great engineer, not like a great ML expert. Most problems are engineering problems; most gains come from great features, not fancy algorithms. The basic approach is: Make sure your pipeline is solid end‑to‑end. Start with a reasonable objective. Add common‑sense features in a simple way. Keep the pipeline robust.
Before Machine Learning
Rule #1: Don’t be afraid to launch a product without machine learning.
Early in a project, heuristic or rule‑based solutions can provide quick wins; ML can be added later.
Rule #2: First, design and implement metrics.
Define the metrics you care about, collect historical data, visualise them, and set up reliable A/B testing.
Rule #3: Choose machine learning over a complex heuristic.
When rules become too tangled, replace them with a learned model to improve maintainability and scalability.
ML Phase I: Your First Pipeline
Rule #4: Keep the first model simple and get the infrastructure right.
Focus on data collection, baseline evaluation, and a clear integration path (online vs. offline).
Rule #5: Test the infrastructure independently from the machine learning.
Each pipeline component should be unit‑testable and runnable without the model code.
Rule #6: Be careful about dropped data when copying pipelines.
Avoid silently losing data when reusing code from other projects.
Rule #7: Turn heuristics into features, or handle them externally.
Convert existing rules into features, use them for preprocessing, or modify labels to reflect business goals.
Monitoring
Rule #8: Know the freshness requirements of your system.
Determine how often the model must be retrained for your use case.
Rule #9: Detect problems before exporting models.
Validate model quality (e.g., AUC) before deployment and alert engineers on anomalies.
Rule #10: Watch for silent failures.
Monitor data pipelines for missing or stale data that can silently degrade performance.
Rule #11: Give feature column owners and documentation.
Maintain clear documentation for each feature column to aid onboarding and cross‑team collaboration.
Your First Objective
Rule #12: Don’t overthink which objective you choose to directly optimize.
Start with a single, clear objective (e.g., click‑through rate) and expand later.
Rule #13: Choose a simple, observable and attributable metric for your first objective.
Pick metrics that directly reflect user actions such as clicks, downloads, or shares.
Rule #14: Starting with an interpretable model makes debugging easier.
Linear models are easier to understand and troubleshoot than black‑box models.
Rule #15: Separate spam filtering and quality ranking in a policy layer.
Filter low‑quality content before ranking to keep the ranking system stable.
ML Phase II: Feature Engineering
Rule #16: Plan to launch and iterate.
Expect multiple releases; design features and models for easy iteration.
Rule #17: Start with directly observed and reported features as opposed to learned features.
Use raw, observable features for the first model; add learned features later.
Rule #18: Explore with features of content that generalize across contexts.
Leverage cross‑domain signals (e.g., global conversion rates) to enrich the model.
Rule #19: Use very specific features when you can.
Specific ID‑type features often provide strong signals when data volume is sufficient.
Rule #20: Combine and modify existing features to create new features in human‑understandable ways.
Apply discretisation and feature crossing while avoiding over‑crossing that leads to over‑fitting.
Rule #21: The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have.
Scale model complexity with data size (e.g., thousands of features for millions of examples).
Rule #22: Clean up features you are no longer using.
Remove stale features to keep the system lean and maintainable.
Human Analysis of the System
Rule #23: You are not a typical end user.
Test with real users or crowdsourced feedback rather than relying on engineer intuition.
Rule #24: Measure the delta between models.
Compare new and production models on the same inputs; only ship if the difference is beneficial.
Rule #25: When choosing models, utilitarian performance trumps predictive power.
Prioritise real‑world impact over raw predictive metrics.
Rule #26: Look for patterns in the measured errors, and create new features.
Analyse mis‑classifications to engineer new informative features.
Rule #27: Try to quantify observed undesirable behavior.
Turn qualitative complaints into measurable features or metrics.
Rule #28: Be aware that identical short‑term behavior does not imply identical long‑term behavior.
Short‑term metrics may not reflect long‑term system health; consider exploration‑exploitation trade‑offs.
Training‑Serving Skew
Rule #29: The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time.
Log serving‑time features for offline training to keep pipelines aligned.
Rule #30: Importance weight sampled data, don’t arbitrarily drop it!
When down‑sampling, re‑weight samples to preserve statistical validity.
Rule #31: Beware that if you join data from a table at training and serving time, the data in the table may change.
Avoid stale joins; use immutable feature stores.
Rule #32: Re‑use code between your training pipeline and your serving pipeline whenever possible.
Share codebases or libraries to minimise discrepancies.
Rule #33: If you produce a model based on the data until January 5th, test the model on the data from January 6th and after.
Use a temporal hold‑out set for evaluation.
Rule #34: In binary classification for filtering, make small short‑term sacrifices in performance for very clean data.
Prioritise data cleanliness over marginal metric gains in spam‑filtering scenarios.
Rule #35: Beware of the inherent skew in ranking problems.
Regularise popular items and avoid over‑reliance on positional features.
Rule #36: Avoid feedback loops with positional features.
Do not feed the model its own ranking position as a feature without careful handling.
Rule #37: Measure Training/Serving Skew.
Track differences between training data, serving data, and next‑day data to spot pipeline bugs.
ML Phase III: Slowed Growth, Optimization Refinement, and Complex Models
Rule #38: Don’t waste time on new features if unaligned objectives have become the issue.
When objectives diverge, focus on aligning them before adding more features.
Rule #39: Launch decisions are a proxy for long‑term product goals.
Model releases should be evaluated against overall product health, not just isolated metrics.
Rule #40: Keep ensembles simple.
Use straightforward model stacking; avoid overly complex ensembles that are hard to interpret.
Rule #41: When performance plateaus, look for qualitatively new sources of information rather than refining existing signals.
Seek novel data sources or deep‑learning approaches when incremental feature tweaks stop helping.
Rule #42: Don’t expect diversity, personalization, or relevance to be as correlated with popularity as you think.
Post‑process rankings to improve diversity and relevance beyond pure popularity.
Rule #43: Your friends tend to be the same across different products. Your interests tend not to be.
Friend‑graph features often transfer across products, whereas interest signals are product‑specific.
Conclusion
The 43 rules provide concrete, experience‑backed guidance for building, scaling, and maintaining machine‑learning‑driven systems. By focusing on solid pipelines, clear metrics, thoughtful feature engineering, and continuous monitoring, engineers can avoid many common pitfalls and steadily improve their products.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qizhuo Club
360 Mobile tech channel sharing practical experience and original insights from 360 Mobile Security and other teams across Android, iOS, big data, AI, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
