Databases 15 min read

How OpenMLDB Guarantees Real‑Time, Consistent Features for Machine Learning at Scale

This article explains the data and feature engineering challenges of deploying machine learning, introduces OpenMLDB’s open‑source architecture—including offline Spark‑based processing, a high‑availability online engine with dual‑layer memory indexes, snapshot/binlog persistence, and pre‑aggregation techniques—then showcases real‑world case studies and the project’s roadmap.

ITPUB

Dec 21, 2022

How OpenMLDB Guarantees Real‑Time, Consistent Features for Machine Learning at Scale

Background and Challenges

Machine‑learning pipelines spend up to 95% of effort on data preparation, and ensuring that offline training features match online inference data is a major engineering obstacle. Real‑time inference also demands sub‑millisecond latency and high concurrency, especially in finance and fraud‑detection scenarios.

OpenMLDB Overview

OpenMLDB is a purpose‑built database for machine‑learning feature serving. It supports both offline model development and online inference with a unified SQL‑based workflow, guaranteeing feature consistency across the two stages.

System Architecture

The platform consists of two engines:

Offline engine: a Spark‑derived SQL processor that extracts and materialises features for model training.

Online engine: a self‑developed real‑time SQL engine that computes features on streaming data.

Both engines share a single execution‑plan generator, ensuring identical feature calculations.

High‑Availability Online Engine

The online engine is composed of three core modules:

ZooKeeper – maintains metadata updates.

Nameserver – manages tablet nodes and handles failover.

Tablets – store table partitions and execute distributed SQL.

Data is indexed with a two‑layer linked‑list structure: the first layer groups by key (e.g., GROUP BY), the second layer indexes timestamps for fast window queries. The design supports efficient inserts, queries, TTL‑based expiration, and snapshot/binlog persistence for durability.

Performance Optimisations

OpenMLDB offers two storage options:

In‑memory engine – millisecond‑level latency, high concurrency, higher cost.

Disk‑based engine (RocksDB) – lower cost, suitable when latency requirements are relaxed.

Pre‑aggregation reduces the cost of large time‑window calculations by summarising data during ingestion.

Benchmarks show latency under 20 ms even for 10 K‑size windows, and throughput remains stable across varying data volumes thanks to the logarithmic‑time double‑layer index.

Case Studies

Akulaku – a Southeast Asian fintech company processes ~10 billion transactions per day. Using OpenMLDB, they achieved sub‑4 ms inference latency and unified offline/online feature pipelines.

37手游 – a mobile‑gaming company uses OpenMLDB to predict user churn within 3‑15 day windows, deploying a three‑node cluster that delivers real‑time feature extraction for both offline training and online scoring.

Development History and Future Roadmap

OpenMLDB was open‑sourced in June 2021 (v0.1) after an internal closed‑source phase (RTIDB/FEDB). Subsequent releases (v0.6 in Aug 2022, v0.0.64 in Oct 2022) added stability and feature enhancements. The upcoming v0.7 will extend SQL capabilities, improve stability, and enhance usability.

OpenMLDB is already adopted by customers such as Akulaku, 37手游, Huawei, and JD Tech, demonstrating its suitability for high‑throughput, low‑latency feature serving in production environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning real-time analytics Feature Store OpenMLDB

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.