Feature Production Scheduling: Architecture Evolution and Core Technologies
Using Meituan‑Dianping’s hospitality online feature system as a case study, the article describes how feature production scheduling evolved from offline batch ETL to automated, metadata‑driven pipelines and sub‑second streaming, detailing the underlying architecture, incremental updates, storage abstraction, write‑shaving, atomicity, and recovery mechanisms.
In the previous article "Data Access Techniques in Online Feature Systems", the authors introduced storage and retrieval aspects of online feature systems. This follow‑up focuses on the equally important topic of feature production scheduling, using Meituan-Dianping's hospitality online feature system as a case study.
Feature Production Scheduling Evolution
From Offline to Online
The goal of an online feature system is to expose offline‑computed features via an API for downstream strategy services. Requirements include daily updates, hundred‑billion‑scale data, and sub‑20 ms latency at peak QPS of millions. The initial architecture writes daily offline features into a distributed KV store (Tair) via ETL and serves them through a Thrift‑based RPC service. Features are abstracted as Domain objects (e.g., Domain=ABC for user profile features) that encapsulate the feature set and its query dimension.
From Manual to Automated
As the number of Domains grew, manual ETL development became a bottleneck. The team introduced a metadata‑driven, platform‑based import workflow: users fill a small form, the system stores metadata (source DB, table, storage engine, key/value fields, update schedule, partitioning, etc.) in a MySQL Settings module, and a scheduler automatically generates and runs the import jobs. This reduced onboarding time from hours to minutes and added support for multiple storage engines (Tair, Squirrel, etc.).
From Day‑Level to Second‑Level
Real‑time features require sub‑second freshness. The team built a streaming platform based on Storm that consumes Kafka topics, applies configurable aggregation logic (sum, count, max, min, avg, distinct count, last, list) over fixed, sliding, or infinite windows, and writes results back to the KV store. A delay‑queue mechanism enables sliding‑window updates without retaining all raw events.
Real‑time Feature Computation Platform
The platform supports 24 common feature types (combinations of three window kinds and eight aggregation functions). The processing flow consists of three abstract steps: read prior state, compute new value, and write back. Implementations for fixed windows embed the timestamp in the key; sliding windows use a delay queue to offset expired contributions; infinite windows rely on offline baselines plus incremental real‑time updates.
Real‑time Feature Optimization
To handle high QPS, the system adopts incremental computation (e.g., maintaining sum and count for averages) and approximate algorithms such as HyperLogLog for distinct counts.
Feature Production Scheduling Techniques
Logical Storage Layer
Domain metadata is decoupled from storage details via a Storage entity. This enables versioned data, read/write separation, and atomic switches between storage versions.
Incremental Update and Data Consistency
Instead of full daily reloads, the team computes diffs between successive snapshots (SNAPSHOT) and only writes changed keys, dramatically reducing load. Each record carries a lease; expired leases force inclusion in the next diff, guaranteeing eventual consistency.
Write Peak Shaving
Offline jobs are throttled by the scheduler (max concurrent sync jobs × per‑job concurrency ≤ storage write capacity). Real‑time writes are mediated by an Updater service that enforces per‑client rate limits and can reject or delay excess traffic.
Atomic Update
Offline updates are day‑level atomic via the logical storage layer. Real‑time updates achieve atomicity either by single‑threaded key groups or by exposing a CAS (compare‑and‑swap) API.
Data Fusion and Recovery
Offline calculations provide baselines for long‑term windows, while real‑time streams handle recent data. Periodic offline snapshots allow fast recovery: if a real‑time failure occurs, the system can roll back to the latest snapshot and replay streams from that point.
Conclusion
The online feature system now covers loading, computation, import, storage, and retrieval, but further work remains: supporting more offline frameworks, richer real‑time types, high‑availability real‑time computation, faster recovery, and integrated monitoring. The authors invite interested engineers to join Meituan’s data mining team.
Authors
Yang Hao – Head of Data Mining Systems, Meituan Platform & Hospitality Group (Peking University, 2011). Wei Bin – Data Mining Systems Engineer, Meituan Platform & Hospitality Group (Dalian University of Technology, 2015).
Recruitment notice: Meituan’s data mining team is hiring for algorithm, big‑data system development, and Java backend positions. Interested candidates may send resumes to yanghao13#meituan.com.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
