How Data-Juicer Powers Multi‑Modal Data Processing for Large Language Models

This article explains the evolution of Data‑Juicer from a pure‑text preprocessing tool to a full‑stack multi‑modal data engine, detailing its architecture, operator library, Ray‑based distributed execution, performance benchmarks, integration with AI agents, and roadmap for future AI‑centric data workflows.

DataFunSummit
DataFunSummit
DataFunSummit
How Data-Juicer Powers Multi‑Modal Data Processing for Large Language Models

Background: From Model‑Centric to Data‑Centric AI

Since 2022, research on large language models (LLMs) has shifted from focusing solely on model architecture to emphasizing high‑quality, multi‑modal data as a core factor for model performance. Papers on "Data & MLLMs" have surged, highlighting the need for scalable data cleaning, augmentation, and annotation pipelines.

Data‑Juicer Overview

Data‑Juicer, an open‑source project from Tongyi Lab, started as a 1.0 pure‑text pre‑training pipeline and has evolved to a 2.0 version that supports video, audio, and other rich media. It provides a modular operator library (≈200 operators), a user‑friendly low‑code UI, and a distributed execution engine built on Ray.

Key Features

Operator Library: Includes filters, mappers, aggregators, and AI‑specific operators (e.g., language ID, deduplication, safety checks, back‑translation, synthetic Q&A).

Low‑Code / No‑Code Interface: Web UI, CLI, and RESTful API allow users to compose pipelines without writing code; an Agent‑style conversational interface can translate natural‑language intents into operator chains.

Distributed Execution: Ray‑based engine supports CPU/GPU hybrid execution, automatic fault tolerance, operator fusion, and dynamic resource scheduling.

Architecture Details

The system consists of four pillars: atomic operators, a user‑friendly interface, a Ray‑powered distributed engine, and utility validation modules that close the loop between data and model training.

Operators are organized along four dimensions—operator type, modality type, functional type, and implementation type—enabling systematic reuse and clear semantics. For example, clean_html_mapper removes HTML tags, while text_classifier_mapper can load a remote PyTorch model on specified GPUs.

Runtime Layer

Data‑Juicer builds a computation graph ( DJ‑Dataset and DJ‑Operator) that is executed by Ray. The graph supports batch processing, operator fusion, and fine‑grained fault tolerance. Ray’s map_batches and custom executors ( DJRayExecutor, PartitionedRayExecutor) enable high‑throughput data ingestion from OSS/S3 and efficient CPU/GPU scheduling.

Performance Benchmarks

On a 6400‑core CPU cluster, Data‑Juicer processes 70 billion samples in 0.45 h; on an 8 × 64 A100 GPU cluster, the same workload finishes in 1.8 h. The custom ray_bts_minhash_deduplicator operator achieves a 3.3× speed‑up over native Ray GroupBy by using BTS‑based load‑balancing, hash partitioning, and Cython‑level optimizations.

Integration with AI Workflows

Data‑Juicer integrates with AgentScope to provide an "Agent" that can automatically recommend operators based on natural‑language prompts, generate pipelines, and visualize data quality metrics. This enables a "recommend‑edit‑validate‑feedback" loop for rapid data‑centric experimentation.

Future Roadmap

Planned enhancements include deeper integration of AI operators (AFlow), cost‑based optimization (HBO), high‑performance shuffle via RayData, and an open community hub (Data‑Juicer Hub) for sharing reusable recipes across LLM, RAG, and multimodal tasks.

Conclusion

Data‑Juicer demonstrates how a unified, open‑source data engine can bridge the gap between traditional batch processing (Spark/Flink) and modern AI‑centric pipelines, delivering scalable, fault‑tolerant, and extensible data processing for next‑generation multimodal models.

data processinglarge language modelsMulti-ModalRayData-Juicer
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.