Backend Development 15 min read

How Ant Group Supercharged AI Data Pipelines with Ray: Boosting Index Build Speed and Reliability

This article details Ant Group's use of the Ray distributed computing framework to accelerate massive data indexing, migrate a C++ engine to Ray, implement elastic resource scheduling, improve long‑tail task efficiency, and build a robust RAG operator system with comprehensive governance, achieving up to 2× speed gains and 99.9% success rates.

DataFunSummit

Mar 3, 2026

How Ant Group Supercharged AI Data Pipelines with Ray: Boosting Index Build Speed and Reliability

Ray‑Based Massive Data Construction Efficiency

Ant Group replaced a monolithic C++ index‑building engine (≈200 k lines) with a Ray‑based architecture to construct trillion‑scale forward and inverted indexes (KV, KKV). Ray provides elastic resource scheduling, task‑level granularity, and built‑in fault‑tolerance, which eliminates Kubernetes container contention and OOM failures.

Elastic resource scheduling : Ray dynamically creates containers for each task, allowing on‑demand scaling and preventing resource starvation.

Long‑tail scenario optimization : A three‑stage Processor‑Builder‑Merge pipeline runs small‑batch jobs concurrently, reusing resources and reducing P95 latency for sub‑100 GB tables from tens of minutes to ~10 minutes (≈1× overall speedup).

Stability and success rate : Integrated task retry, failure isolation, and state persistence raise job success to >99.9%.

Ray‑Based Retrieval‑Augmented Generation (RAG) Operator System

A unified operator framework on Ray addresses fragmented data‑processing pipelines in RAG scenarios. The system consists of three layers:

Operator marketplace : Supports multi‑tenant registration, permission control, and billing. Custom operators are uploaded and automatically registered with metadata.

Operator programming model : Python decorators define explicit input/output contracts for operators such as SourceOp, ParseOp, and ChunkOp. The model enforces type checking and JSON‑Schema/Protobuf validation, promoting reusable components.

Operator service : Ray‑driven execution guarantees SLA for batch, streaming, and online serving workloads. Health‑check, fail‑over, and auto‑scale mechanisms ensure high availability.

API Governance Layer

The API layer provides six core capabilities: call metering, OAuth‑based permission isolation, flexible billing, rate limiting, circuit breaking, and intelligent routing. Precise tracking of CPU, memory, and GPU usage enables fine‑grained cost allocation.

Migration Benefits and Performance Gains

Over six months, 80% of the legacy C++ index‑building tasks were migrated to Ray. Key changes include:

Actor refactor : Workers were converted to Ray actors, shifting execution granularity from container‑level to process‑level. This reduces cold‑start overhead and improves small‑table indexing speed.

Container spec simplification : More than ten container variants were consolidated into two standard sizes. Ray’s dynamic allocation eliminates scheduling overhead and mitigates Kubernetes resource conflicts.

Reliability enhancements : Integrated health‑check, fail‑over, and auto‑scale raise the overall success rate to >99.9% and lower operational burden.

Future Outlook

Ant Group plans to evolve the AI‑Native data construction engine along three directions:

Deep AI‑Native integration for unified structured and unstructured data processing.

Unification of the Remote Shuffle Service (RSS, now Apache Celeborn) to improve large‑scale data exchange, especially for Ray Data shuffle phases.

Embedding evolution to support multimodal perception and shift pipelines from compute‑driven to intelligence‑driven designs.

Q&A Highlights

Q1: Are online RAG pipelines built on Ray Serve?

A1: No. They run on Ant Group’s proprietary online service engine, not Ray Serve.

Q2: Is the Remote Shuffle Service (RSS) related to Ray and used with Ray Data?

A2: Yes. RSS (now Apache Celeborn) is integrated with Ray, providing a unified shuffle service for Spark, Flink, and Ray Data, thereby enhancing the performance and stability of Ray Data’s data‑exchange stage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development Distributed Computing Ray ai data pipeline index building rag operators

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.