Artificial Intelligence 15 min read

Ant Financial’s Online Learning System Built on Ray: Architecture, Challenges, and Future Plans

The interview details how Ant Financial transitioned from offline to online machine learning by adopting the Ray distributed engine, describing their open architecture, fusion computing approach, technical advantages, encountered pitfalls, and plans to open‑source the system for broader AI and big‑data use.

AntTech
AntTech
AntTech
Ant Financial’s Online Learning System Built on Ray: Architecture, Challenges, and Future Plans

Overview: Most AI applications today rely on offline supervised learning, training models offline and deploying them for online inference, which limits real‑time responsiveness. Ant Financial aims to move toward online learning that can adapt instantly to dynamic environments, achieving faster model iteration and more accurate predictions.

Adoption of Ray: Since July 2018, Ant Financial has built a financial‑grade online learning system on the Ray distributed engine, improving end‑to‑end latency, stability, and development efficiency compared with traditional online learning frameworks.

Evolution of Ant Financial’s Data Architecture: The architecture progressed from Hadoop‑based offline computing (2011‑2013) to Storm‑based real‑time streaming (post‑2013), and finally to an open, modular architecture (from 2017) that treats computation engines as plug‑in components, enabling seamless integration of batch, stream, graph, and machine‑learning workloads.

Fusion Computing Concept: By abstracting computation modes (stream, batch, graph, ML) from the underlying distributed service layer, Ant Financial uses Ray as a native, extensible framework that can support multiple modes without being bound to a single paradigm, addressing limitations of Spark and Flink.

Advantages and Drawbacks of Ray: Ray’s strengths lie in its native, highly extensible design and strong scalability, allowing flexible modifications. Its drawbacks stem from its youth; the engine is still evolving, with many features and ecosystem tools requiring further development.

Challenges and Pitfalls: Engineering challenges include scaling Ray from laboratory tests to production workloads, ensuring reliability, integrating with TensorFlow, handling multi‑language scheduling, and maintaining low‑latency online training. Operational issues involve version control, model isolation, and robust deployment pipelines.

Open‑Source Plans: Ant Financial intends to open‑source its online learning framework and Ray customizations in March 2020, aiming to reduce duplication of effort for other companies and foster community contributions.

Future Directions: The roadmap emphasizes an open, fused computing architecture, unified graph computing, and tighter hardware‑software integration, while also monitoring emerging technologies such as data lakes and large‑scale dynamic graph processing.

Interviewee: Zhou Jiayin, senior technical expert at Ant Financial, leads the online computing team and has overseen the evolution from offline to online data processing.

big dataAIonline learningdistributed computingRayant financial
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.