Artificial Intelligence 11 min read

Building a Unified Online Machine Learning Platform with Ray for Alipay’s “Collect Five Blessings” Campaign

The article describes how Alipay tackled the cold‑start, conversion, and user‑experience challenges of its time‑limited “Collect Five Blessings” activity by designing a unified online machine‑learning system based on the Ray distributed‑computing framework, emphasizing stability, efficiency, simplicity, multi‑language support, and fault‑tolerant scheduling.

AntTech
AntTech
AntTech
Building a Unified Online Machine Learning Platform with Ray for Alipay’s “Collect Five Blessings” Campaign

During the Spring Festival, Alipay’s “Collect Five Blessings” campaign required rapid, personalized reward matching for millions of users, exposing cold‑start and conversion‑rate problems that could be framed as an online learning optimization task.

The initial solution involved many fragmented modules—log collection, stream processing, feature stitching, model training, validation, and real‑time model serving—leading to high system complexity, SLA degradation, inefficient I/O, and heavy development/operations costs.

To address these issues, the team defined three goals: “steady, fast, simple” (稳快简). They needed end‑to‑end data and computation consistency, internal task scheduling instead of cross‑system orchestration, and a single platform that reduced integration effort for developers and operators.

Ray, an open‑source distributed‑computing framework co‑developed by Berkeley’s RiseLab and Ant Financial, was chosen as the foundation because it offers agile scheduling, heterogeneous resource management, and built‑in fault tolerance, while exposing three core capabilities—data processing, model training, and model serving.

Ray’s programming model maps distributed primitives (tasks, objects, services) to familiar concepts (functions, variables, classes). By adding an @remote decorator to a function or class, developers can turn them into distributed tasks or services, enabling seamless conversion from single‑machine code to distributed execution.

The platform supports multi‑language APIs, allowing Java‑based stream operators and Python‑based machine‑learning models to coexist, which matches Ant Financial’s internal technology stack.

Dynamic DAG generation and Ray’s Actor model provide on‑the‑fly task insertion and robust fault‑tolerant training, ensuring that online learning pipelines can adapt without interrupting existing services.

Stability is addressed on three fronts: system stability (real‑time, strong consistency), model stability (handling noise and combining online/offline features), and mechanism stability (race‑condition handling and fast rollback).

Since its rollout in February 2023, the unified platform has achieved 99.9% end‑to‑end SLA, 2‑40% business metric improvements, reduced model latency from tens of minutes to 4‑5 minutes, and a 60% reduction in machine usage across multiple Alipay scenarios.

The success demonstrates that a Ray‑based fusion of stream processing and machine learning can deliver a cloud‑native, open, and extensible architecture for complex, time‑critical financial services.

system architecturedistributed computingRayAlipayOnline Machine Learning
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.