How Data‑Juicer Supercharges LLM Training with High‑Quality Multimodal Data
Data‑Juicer is an open‑source, one‑stop multimodal data processing system that provides fine‑grained operators, scalable pipelines, and ready‑made recipes to deliver high‑quality, diverse, and model‑friendly data for large language model pre‑training, fine‑tuning, and multimodal applications.
Data‑Juicer: A One‑Stop Multimodal Data Processing System for LLMs
In the era of rapidly advancing large language models (LLMs), data quality is a decisive factor for model performance. Even the most advanced architectures cannot benefit from low‑quality, inconsistent, or unsuitable data, just as a healthy body cannot be sustained by junk food.
Alibaba researchers built Data‑Juicer, a dedicated data‑nutrition platform for LLMs. It automatically generates data recipes, explores data‑mix combinations, and evaluates their impact on LLM performance, delivering higher‑quality, richer, and easier‑to‑"digest" data for healthier model development.
Core Features and Advantages
Systematic and Reusable : Over 80 core operators, 20+ configuration recipes, and a dedicated tool pool enable data pipelines that are independent of specific LLM datasets.
Data‑Feedback Loop & Sandbox Lab : Integrated analysis, automatic reporting, and multi‑dimensional evaluation create a closed‑loop between data processing and model training, allowing rapid iteration.
Comprehensive Processing Recipes : Dozens of pre‑built recipes for pre‑training, fine‑tuning, and bilingual scenarios accelerate onboarding and customization.
Efficient Parallel Processing : Pipelines run on Aliyun‑PAI, Ray, Slurm, CUDA, or operator fusion, reducing memory and CPU overhead for large‑scale datasets.
User‑Friendly : Full documentation, simple guides, and easy addition/removal of operators make the system accessible to beginners.
Flexible and Extensible : Supports JSONL, Parquet, CSV and many multimodal formats; users can compose custom operators or develop new ones.
Operator System and Data Pipeline
Data‑Juicer provides more than 80 operators grouped into five types:
Formatter : Discover, load, and normalize raw data.
Mapper : Edit and transform data samples.
Filter : Remove low‑quality samples.
Deduplicator : Detect and delete duplicate samples.
Selector : Rank and select high‑quality samples.
The pipeline executes operators in a user‑defined order, supports parallel execution, multiple runtime environments (Aliyun‑PAI, Ray, Slurm, CUDA), and operator fusion for higher efficiency.
Supported Data Formats and Multimodal Capabilities
Data‑Juicer handles text, image, audio, and video data. For video, it provides decoding, frame extraction, feature extraction, and subtitle extraction. It also offers filtering, mapping, deduplication, format conversion, and aesthetic scoring for multimodal inputs.
Installation and Usage
Requirements: Python ≥ 3.8, gcc ≥ 5 (C++14 support), Linux/macOS. Installation can be performed from source or via pre‑compiled packages.
# Clone repository
git clone https://github.com/modelscope/data-juicer.git
cd data-juicer
# Install development dependencies
pip install -e "[dev]"
# Optionally use uv for virtual‑env management
curl -LsSf https://astral.sh/uv/install.sh | sh # install uv
uv venv --python 3.10 # create venv
source .venv/bin/activate # activate
uv pip install -e . # install minimal depsTypical workflow:
Configure data sources (paths, formats).
Select and configure operators (Formatter, Mapper, Filter, etc.).
Define the pipeline order and parallelism.
Execute the pipeline to produce cleaned, deduplicated, and transformed data.
Application Scenarios and Real‑World Cases
Data‑Juicer is used for pre‑training data cleaning, fine‑tuning data preparation, and multimodal LLM training across text, image, audio, and video domains. Its modular design allows easy adaptation to specific tasks and datasets.
Comparison with General‑Purpose Data Tools
Designed specifically for LLMs, offering operators and metrics aligned with language‑model needs.
Native multimodal support (text, image, audio, video).
Highly parallelizable pipelines for massive datasets.
Extensive ready‑made recipes reduce engineering effort.
Future Development
Further enhance multimodal (especially video) processing capabilities.
Optimize operator implementations for even higher throughput.
Expand data‑quality evaluation metrics and automated reporting.
Add new recipes for emerging LLM use‑cases.
Conclusion
Data‑Juicer acts as a professional data‑nutritionist for LLMs, providing systematic, reusable, and efficient data processing that transforms raw, heterogeneous data into high‑quality, model‑ready inputs. Its open‑source nature invites community contributions to continuously advance LLM data engineering.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
