Building a Fusion Engine with Ray: Ant Group’s Large‑Scale Distributed Computing Practices
The article explains how Ant Group tackles the challenge of tightly integrating multiple computing paradigms by building a Ray‑based fusion engine, detailing its architecture, features, large‑scale applications in online machine learning and parallel processing, and outlining future development and recruitment opportunities.
With the rapid emergence of new Internet technologies, solving complex problems often requires deep integration of multiple domains such as OLAP, graph computing, stream processing, and deep learning; Ant Group therefore built a fusion engine on the open‑source distributed framework Ray to address coordination, performance, and development‑efficiency challenges.
Ray provides three basic distributed primitives—tasks, objects, and services—that map neatly to the three fundamental programming concepts of functions, variables, and classes, allowing simple transformations between them.
The main challenges when combining different technologies are complex system coordination, performance optimization across heterogeneous engines, and reduced development efficiency due to difficulty in debugging and locating issues.
The fusion engine offers a simple, universal API, multi‑language support, elastic and customizable task scheduling, distributed state management, easy error handling and fault recovery, and low‑cost DevOps.
Ray, initiated by Berkeley’s RiseLab and co‑developed with Ant Group, aims to make distributed system development and usage simpler, featuring agile scheduling and heterogeneous resource allocation.
By adding the @remote decorator to a function, it becomes a distributed task that can be invoked with .remote() , returning a variable that participates in further computations.
Similarly, decorating a class with @remote turns its methods into distributed services, enabling seamless conversion of single‑machine programs into distributed tasks.
In Ant’s financial‑intelligence architecture, the fusion engine runs on Kubernetes, using Ray as the underlying distributed platform to support dynamic graph computing, online machine learning, and other real‑time risk‑control applications.
Ant Group is a top contributor to the Ray project, providing Java APIs, GCS fault‑tolerance features, and substantial code contributions, which have helped Ray scale in production.
One major application is an end‑to‑end online machine‑learning system built on Ray, comprising real‑time data processing, distributed training, and model deployment, offering exactly‑once semantics, cross‑language development, and automated model updates.
Another use case is Ray‑MPP, an online computation system for large‑scale parallel processing in financial decision‑making, delivering ultra‑low latency, high availability, and exact‑once data guarantees for payment‑success‑rate calculations during peak periods.
Looking ahead, the Ray project continues to attract contributions from companies like Uber, Intel, Microsoft, and ByteDance, and Ant Group plans to enhance Ray’s scalability, performance, scheduling, elasticity, and ecosystem.
Finally, Ant Group is recruiting 2021‑2022 graduates for R&D and algorithm engineering positions, inviting applications to [email protected].
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.