Should You Monitor Your Machine Learning Models? An Introduction with Evidently AI
The article explains why monitoring production ML models is essential to detect data and target drift, describes the open‑source Evidently AI library and its statistical tests, and demonstrates its use on a weather‑forecast example and a plant‑seedling image classification case, including dashboards, code snippets, and visual analysis of drift impact.
This article introduces the need for monitoring machine‑learning models in production, emphasizing that business operations assume the data distribution remains unchanged between training and inference. When distribution drift occurs, model accuracy can degrade, leading to adverse effects such as missed medical predictions or ineffective retail coupons.
Why Monitor Models?
Monitoring ensures that model performance metrics (accuracy, memory usage, CPU usage) stay within the expected range established during training and testing. The article cites literature that discusses key techniques such as prior‑probability shift and covariate shift.
Evidently AI Library
Evidently AI is an open‑source Python library that creates visual dashboards comparing training and production datasets. It works by mapping features from the training set to the production set and applying statistical tests:
Binary categorical features: Z‑test for proportion differences.
Multiclass categorical features: Chi‑square test for distribution differences.
Numerical features: Two‑sample Kolmogorov‑Smirnov test for distribution similarity.
The library marks statistical results to indicate whether the training and inference data distributions have changed.
Example 1 – Tabular Data Drift (Weather Forecast)
The first example predicts Chicago weather using temperature data from five nearby cities. To illustrate drift, the temperature data for Toronto is replaced with data from Phoenix, Arizona. Visualizations (Figures 2 and 3) show the original and shifted distributions.
weather_data_drift_report = Dashboard(tabs=[DataDriftTab])
weather_data_drift_report.calculate(df_old.drop('Chicago', axis=1), df.drop('Chicago', axis=1), column_mapping=None)
weather_data_drift_report.save("gdrive/MyDrive/ModelMonitoringBlog/reports/my_report_with_2_tabs.html")The generated dashboard (Figure 4) highlights that Toronto’s distribution deviates significantly, while Detroit shows no alteration, indicating potential erroneous results.
Target (Prior‑Probability) Shift
Evidently AI can also monitor shifts in the model’s output distribution. When input data remains stable but the target variable distribution changes (e.g., credit‑card default rate moving from 20% to 30%), business impact can be severe. Figure 5 visualizes such a shift.
Example 2 – Image Data Drift (Plant Seedlings)
The second example uses a CNN to classify 12 classes of seedling images, comparing training images of beet seedlings with production images that include a different species (Shepherd’s‑purse). The workflow includes loading images, resizing them to a consistent n×n array, and converting pixel values into feature arrays.
Read a set of images.
Resize each image to the desired pixel dimensions.
Create a feature array where each pixel is a feature (n × n).
After preparing the data, Evidently AI runs the data‑drift module and produces a dashboard (Figure 9). Experiments show:
Experiment 1: Minimal drift (2.5% of features) when comparing identical beet images, visualized in Figure 12.
Experiment 2: Significant drift (907 out of 2500 features ≈ 36%) when comparing beet images to Shepherd’s‑purse images, visualized in Figures 13 and 16.
True/False drift matrices (Figure 11) are converted back to NumPy arrays and reshaped to the original image size for visual inspection.
Classification Performance Reporting
Evidently AI also provides classification performance dashboards, supporting binary and multi‑class models. The following code creates a performance report:
reference = pd.DataFrame(columns=['target','prediction'])
production = pd.DataFrame(columns=['target','prediction'])
reference['target'] = training_img_list['target']
reference['prediction'] = target_list
production['target'] = testing_img_list['target']
production['prediction'] = predictions_list
data_dict = {'target':'target','prediction':'prediction'}
classification_performance_report = Dashboard(tabs=[ClassificationPerformanceTab])
classification_performance_report.calculate(reference, production, column_mapping=data_dict)
classification_performance_report.save("reports/classification_performance_report.html")Figures 19‑22 display macro‑averaged metrics (F1, precision, recall, accuracy), class distribution, confusion matrix, and quality metrics, all rendered with Plotly.
Conclusion
Evidently AI’s outputs can identify all features exhibiting data drift. Whether to retrain the model based on these findings depends entirely on business considerations and the importance of the affected features for predictions.
References:
https://evidentlyai.com/
Giselsson et al., “A Public Image Database for Benchmark of Plant Seedling Classification Algorithms” (2017), arXiv:1711.05458.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
