Deploying Scikit‑learn and HMMlearn Models as High‑Performance Online Prediction Services Using ONNX
This article demonstrates how to convert traditional scikit‑learn and hmmlearn machine‑learning models into ONNX format and integrate them into a C++ gRPC service for fast online inference, covering environment setup, model conversion, custom operators, performance testing, and end‑to‑end pipeline construction.
The article begins by noting that, despite the hype around deep learning, many practical problems are still best solved with traditional machine‑learning models such as linear regression, decision trees, random forests, SVMs, boosting, and hidden Markov models (HMM). It explains the need to serve these models in production for online prediction.
Using a real‑world example – a dangerous‑driving‑behavior detection system in a mobile app – the workflow is outlined: (1) convert scikit‑learn models to ONNX, (2) run inference with ONNX Runtime in a C++ service, and (3) support custom model conversion and inference.
All required code is hosted in the onnx-ml-demo repository. To obtain it, run: git clone https://github.com/bxkftechteam/onnx-ml-demo.git Then install dependencies with: pip install -r requirements.txt
Background : The app collects GPS, gyroscope, and accelerometer data, uploads it to the cloud, and a C++ gRPC service classifies dangerous actions using a pipeline that standardizes input, applies a random‑forest classifier, discretizes probabilities, and finally runs a Viterbi algorithm on an HMM to detect phone‑use while driving.
The pipeline can be implemented in three ways: (1) rewrite the whole algorithm in C++, (2) wrap the Python logic in a separate FastAPI/Flask service, or (3) use a third‑party C++ library to run the scikit‑learn and hmmlearn models. The third option is chosen, and ONNX is selected for its maturity, performance, and extensibility.
What is ONNX? ONNX (Open Neural Network Exchange) is an open format for representing both deep‑learning and traditional ML models (ONNX‑ML). Models can be converted with sklearn‑onnx and executed on various hardware via ONNX Runtime.
Model conversion and inference : Convert a RandomForestClassifier model ( clf.pkl ) to ONNX with a short Python script ( converter/convert_basic.py ) that defines the input tensor, calls convert_sklearn , and saves clf.onnx . Validate the ONNX model using onnx.checker.check_model . Run inference in Python by creating an OrtSession , feeding a 1000 x 3 float tensor, and retrieving output_label and output_probability . Run the same model in C++ after installing ONNX Runtime (Linux x86_64) with the following script: cd /tmp && \ echo "Downloading ONNX Runtime 1.6.0" && \ wget -q -O "onnxruntime-linux-x64-1.6.0.tgz" https://public-assets-good-drivers-club.oss-cn-beijing.aliyuncs.com/github/microsoft/onnxruntime/onnxruntime-linux-x64-1.6.0.tgz && \ echo "Extracting ONNX Runtime 1.6.0" && tar xf onnxruntime-linux-x64-1.6.0.tgz && \ echo "Installing ONNX Runtime 1.6.0" && \ cp -rfv onnxruntime-linux-x64-1.6.0/include/* /usr/local/include/ && \ cp -rfv onnxruntime-linux-x64-1.6.0/lib/* /usr/local/lib/ && \ ldconfig && \ echo "Cleaning up" && rm -rf /tmp/onnxruntime-* Compile and run the C++ inference binary with make and execute ./infer_basic N where N is the number of samples.
Performance tuning : Initial runs showed that reading the Sequence of Maps output (produced by a ZipMap operator) dominated runtime. By disabling ZipMap during conversion (adding the no_zipmap flag), the output becomes a plain probability vector, reducing output‑reading time from ~30% to ~1% of total inference time.
Custom HMM conversion : Since sklearn‑onnx does not support hmmlearn models, a custom converter was built. The Viterbi algorithm was extracted from hmmlearn , expressed as a minimal ONNX graph, and a custom ONNX Runtime operator was implemented in C++ ( cpp/viterbi.cc ) and registered in the session.
End‑to‑end pipeline : The three models (StandardScaler, RandomForestClassifier, MultinomialHMM) were fused into a single ONNX graph using a pipeline script ( converter/convert_pipeline.py ). This unified model can be served by the existing C++ service without further code changes.
Performance comparison : Benchmarks on an Intel Xeon Platinum 8269CY show that ONNX Runtime is 4–9× faster than the original scikit‑learn + hmmlearn stack for large batch sizes, and up to 300× faster for single‑sample latency.
Conclusion : By leveraging sklearn‑onnx , custom converters, and ONNX Runtime (including a custom Viterbi operator), the article provides a complete, high‑performance solution for deploying traditional ML models as online services, with all code snippets and scripts reproduced for readers.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.