How Pinterest Scaled LLM Data Pipelines with Ray: Boosting Throughput and Cutting Costs

This article details how Pinterest’s senior staff engineer Dr. Luo leveraged the open‑source Ray framework to overcome LLM data‑preprocessing bottlenecks, describing the system’s architecture, key features such as map_batches, Carry‑Over Columns and Accumulators, and the dramatic performance and cost improvements achieved.

DataFunSummit
DataFunSummit
DataFunSummit
How Pinterest Scaled LLM Data Pipelines with Ray: Boosting Throughput and Cutting Costs

Pinterest LLM Processing Pain Points

Pinterest, a global image and video sharing platform with over 5 billion monthly active users, migrated from OpenAI APIs to a private large‑model service. Despite custom models, the data‑preprocessing stage suffered from low GPU utilization (<40%) and a fixed CPU:GPU ratio that throttled throughput for PB‑scale pipelines built on Spark, TensorFlow, PyTorch and orchestrated by Airflow.

Ray’s Breakthrough Approach

Ray, an open‑source distributed compute framework, was adopted to address the "last‑mile" bottleneck. Its core advantages include:

Distributed processing : parallel execution across many nodes.

Intelligent resource management : fine‑grained scheduling of CPUs, GPUs, and other heterogeneous resources.

Ecosystem compatibility : native support for TensorFlow, PyTorch and other AI frameworks.

Ray Architecture

Head Node (global control service) maintains cluster metadata, coordinates worker allocation, and provides a dashboard UI.

Worker Node executes tasks via the Raylet process, manages actor lifecycles, and interacts with distributed object stores.

Co‑ordination : the head node assigns tasks to workers through GCS; workers store results in the object store and synchronize state via Raylet.

Implementation: From Proof of Concept to Production

Key Ray features used in Pinterest’s pipeline:

map_batches() – the core Ray Data batch API that splits datasets into batches and applies user‑defined functions in parallel, dramatically increasing throughput compared with per‑record map() calls.

Carry Over Columns – a mechanism that avoids transmitting non‑computational columns (e.g., user ID, session ID) across nodes, reducing network traffic and memory usage.

Accumulators – distributed state‑sharing tools for real‑time stream aggregation, gradient accumulation in distributed training, and multi‑actor coordination.

Performance Gains

After migrating core workloads from Spark/Torch DataLoader to Ray:

Related‑Pins metric improved 4.7× .

Job runtime dropped from 90 minutes to 19 minutes .

Annual infrastructure cost reduced by 30× thanks to better GPU/CPU scheduling and multi‑model inference parallelism.

Future Directions

Roadmap focuses on auto‑scaling, fine‑tuning support, DataLoader throughput enhancements, and stronger fault tolerance for large‑scale clusters.

Conclusion

Ray has become Pinterest’s central platform for all machine‑learning workloads, delivering substantial efficiency, cost, and scalability benefits while simplifying development of large‑model data pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationLLMdata preprocessingdistributed computingRayPinterest
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.