Artificial Intelligence 9 min read

Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink

The article describes JD's end‑to‑end online learning pipeline for retail search, covering the background, system architecture, real‑time feature collection, sample stitching, Flink‑based incremental training, parameter updates, and full‑link monitoring to achieve low‑latency, high‑accuracy model serving.

DataFunTalk
DataFunTalk
DataFunTalk
Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink

Background: JD retail search requires increasingly timely data and models that can capture real‑time signals; online learning is introduced to continuously adjust models based on live feedback, improving prediction accuracy under rapidly changing environments.

System Overview: The architecture separates static model components (trained offline) from dynamic components (doc‑level weight vectors) that are updated by an online learning task. Core modules include model inference service, ranking feature logging, Flink sample‑stitching, and the online learning task that consumes streaming samples and updates a parameter server (PS).

Real‑Time Feature and Sample Processing: A feature collector filters and de‑duplicates raw search features, handling about 24 million feature records per second. A Flink job stitches these features with user behavior logs, producing real‑time samples at a peak QPS of 50 k, using union + timer logic, RocksDB state backend, and extensive optimizations to mitigate data skew and checkpoint size.

Flink Real‑Time Training: Samples are ordered with Flink watermarks and processed in count‑window batches with configurable timeouts. Training updates are written back to the PS, which serves both offline and online inference; keyed state updates are asynchronous, while operator state updates are synchronous. The system supports multiple label streams (click, add‑to‑cart, purchase, etc.) and customizable sample ratios.

Full‑Link Monitoring: Comprehensive monitoring covers predictor health, feature dumps, sample association rates, latency, and A/B metrics, as well as container‑level CPU/memory metrics. This ensures rapid detection and resolution of any node failures that could affect end‑to‑end ML performance.

Conclusion: Flink provides strong performance, fault tolerance, and batch‑stream integration for real‑time ML. As data volumes and timeliness requirements grow, online learning evolves from a supplement to offline training into a core component of efficient model systems.

monitoringFlinkfeature engineeringsearch rankingonline learningmodel servingreal-time ML
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.