Artificial Intelligence 13 min read

OpenMLDB: A Production‑Grade Feature Platform for Consistent Online and Offline Machine Learning

OpenMLDB is an open‑source machine‑learning database that delivers a production‑grade, consistent online‑offline feature platform for real‑time AI applications such as recommendation, risk control and fraud detection, offering millisecond‑level feature computation, dual SQL engines, extensive ecosystem integration, and a roadmap of new capabilities.

DataFunSummit
DataFunSummit
DataFunSummit
OpenMLDB: A Production‑Grade Feature Platform for Consistent Online and Offline Machine Learning

In many machine‑learning scenarios—real‑time recommendation, risk control, fraud detection—low‑latency, accurate feature supply is essential. OpenMLDB is an open‑source machine‑learning database that provides a production‑grade feature platform with consistent online and offline behavior.

The AI engineering process faces challenges such as massive data processing time, the need for correct and efficient feature provision, and millisecond‑level computation for real‑time decisions. OpenMLDB addresses these by delivering real‑time feature computation, defining real‑time data and computation, and supporting use cases like credit‑card fraud detection within 20 ms.

OpenMLDB’s architecture includes a batch SQL engine built on Spark for offline development and a real‑time SQL engine—a distributed, high‑availability time‑series database—optimised for feature extraction. An execution‑plan generator guarantees consistency between the two engines, and all development is performed via SQL.

The platform supports the full lifecycle from offline feature extraction and model training to online inference, ensuring low latency, high concurrency, and high availability. It offers both in‑memory and disk storage engines, pre‑aggregation techniques, double‑linked‑list data structures, and other optimisations to achieve millisecond‑level processing.

OpenMLDB integrates with data‑ecosystem tools (Kafka, Pulsar, Flink, HDFS, Hive, etc.), machine‑learning frameworks (XGBoost, TensorFlow, PyTorch), and orchestration platforms (Airflow, DolphinScheduler). Recent releases added new SQL syntax, batch support for the online engine, auto‑feature engineering, a Go SDK, and roadmap items such as expanded SQL capabilities and stability enhancements.

A case study of Akulaku, a fintech company, shows 4 ms latency on 1 billion orders. The Q&A section covers consistency between offline and online data, future support for Hudi/Iceberg, separation of offline/online tasks, and extensibility via UDFs written in C++ (with Python support planned).

Data EngineeringSQLAIFeature StoreOpenMLDBreal-time ML
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.