Artificial Intelligence 10 min read

Building a Fusion Engine with Ray: Ant Group’s Large‑Scale Distributed Computing Practices

The article explains how Ant Group tackles the challenge of tightly integrating multiple computing paradigms by building a Ray‑based fusion engine, detailing its architecture, features, large‑scale applications in online machine learning and parallel processing, and outlining future development and recruitment opportunities.

AntTech

Mar 1, 2021

Building a Fusion Engine with Ray: Ant Group’s Large‑Scale Distributed Computing Practices

With the rapid emergence of new Internet technologies, solving complex problems often requires deep integration of multiple domains such as OLAP, graph computing, stream processing, and deep learning; Ant Group therefore built a fusion engine on the open‑source distributed framework Ray to address coordination, performance, and development‑efficiency challenges.

Ray provides three basic distributed primitives—tasks, objects, and services—that map neatly to the three fundamental programming concepts of functions, variables, and classes, allowing simple transformations between them.

The main challenges when combining different technologies are complex system coordination, performance optimization across heterogeneous engines, and reduced development efficiency due to difficulty in debugging and locating issues.

The fusion engine offers a simple, universal API, multi‑language support, elastic and customizable task scheduling, distributed state management, easy error handling and fault recovery, and low‑cost DevOps.

Ray, initiated by Berkeley’s RiseLab and co‑developed with Ant Group, aims to make distributed system development and usage simpler, featuring agile scheduling and heterogeneous resource allocation.

By adding the @remote decorator to a function, it becomes a distributed task that can be invoked with .remote(), returning a variable that participates in further computations.

Similarly, decorating a class with @remote turns its methods into distributed services, enabling seamless conversion of single‑machine programs into distributed tasks.

In Ant’s financial‑intelligence architecture, the fusion engine runs on Kubernetes, using Ray as the underlying distributed platform to support dynamic graph computing, online machine learning, and other real‑time risk‑control applications.

Ant Group is a top contributor to the Ray project, providing Java APIs, GCS fault‑tolerance features, and substantial code contributions, which have helped Ray scale in production.

One major application is an end‑to‑end online machine‑learning system built on Ray, comprising real‑time data processing, distributed training, and model deployment, offering exactly‑once semantics, cross‑language development, and automated model updates.

Another use case is Ray‑MPP, an online computation system for large‑scale parallel processing in financial decision‑making, delivering ultra‑low latency, high availability, and exact‑once data guarantees for payment‑success‑rate calculations during peak periods.

Looking ahead, the Ray project continues to attract contributions from companies like Uber, Intel, Microsoft, and ByteDance, and Ant Group plans to enhance Ray’s scalability, performance, scheduling, elasticity, and ecosystem.

Finally, Ant Group is recruiting 2021‑2022 graduates for R&D and algorithm engineering positions, inviting applications to [email protected].

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Ray Ant Group Fusion Engine online machine learning

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.