Artificial Intelligence 14 min read

Building DataFlow: An Industrial‑Grade LLM Data Pipeline from Documents to Training

The article presents DataFlow, an open‑source, GPU‑centric data‑engineering framework that tackles LLM data‑preparation bottlenecks by defining a two‑level operator taxonomy, a LLM‑driven WebAgent for automatic crawling, a PDF‑to‑Markdown MinerU, a Ray‑based distributed runtime, and extensive multimodal extensions, and validates the design with quantitative experiments showing significant quality gains across math, code, and reasoning benchmarks.

DataFunSummit

Jun 22, 2026

Building DataFlow: An Industrial‑Grade LLM Data Pipeline from Documents to Training

LLM Data Preparation Challenges

80‑90% of effort in large‑model projects is spent on data engineering; noisy web crawls and costly manual annotation become the primary bottleneck in the second phase of model development.

DataFlow Architecture

Operator Abstraction

DataFlow defines a two‑level operator taxonomy. The first level groups operators by capability source: Core (≈20‑30 operators covering speech, text, and vision) and application‑specific operators built on Core for tasks such as RAG, chemistry, and reasoning. The second level classifies operators by execution pattern: Generate (new key creation), Quality Evaluation (scoring), Filter (reducing data volume), and Refine (improving quality). This design lets different teams develop operators independently while a higher‑level Agent can automatically understand and compose pipelines.

WebAgent – Decision‑Execution‑Governance

An LLM‑driven policy engine translates vague requirements into concrete crawling strategies, dynamically routes requests to search engines, dataset platforms, or vertical forums, and finally uses a DOM‑deep‑traversal engine (MinerU) to extract and denoise HTML content into ShareGPT‑compatible training formats.

MinerU – PDF Parsing

MinerU is an open‑source PDF parser that extracts formulas, flowcharts, and tables into Markdown. It has attracted >50 k stars on GitHub and outperforms GPT‑4o and Qwen2.5‑VL‑72B on document‑parsing benchmarks. MinerU is used in internal projects such as Shusheng·PuYu and Shusheng·WanXiang, as well as in knowledge‑base construction for major enterprises.

Syntax Constraints & Compile‑time Checks

Each operator must declare all hyper‑parameters in an init block, receive external arguments via llm_serving, and accept a storage argument as the first positional parameter. Input fields must start with input_, outputs with output_. The compile() function performs static verification of field flows, reducing Agent debugging cycles, while forward(resume_step) enables checkpoint‑based recovery.

RayOrch – Distributed Execution

RayOrch orchestrates multiple conda environments, heterogeneous models (LLM, YOLO, FastText) and multimodal data (CV, NLP, Audio). It supports API‑style nn.Module objects, leverages Ray for full‑cluster GPU utilization, parallel I/O, DAG execution, and NVIDIA Nsight timing optimizations. A custom operator example demonstrates 8×A100 acceleration for a SumOp compared with a single‑GPU run. Small‑model scoring experiments show that using 2, 4, and 8 GPUs yields 1.8×, 3.6×, and 6.1× speed‑ups respectively, and the system integrates vLLM/SGLang for large‑model parallel inference.

Multimodal & Multi‑Source Support

DataFlow‑MM extends the core engine to images, video, and audio. It provides caption synthesis, VQA generation, image‑text interleaved data creation, strong‑reasoning image datasets, speech‑to‑label pipelines, and video‑COT construction. Experiments on Qwen2.5‑VL‑Instruct improve Meteor from 15.73 to 16.27 and Cider from 34.63 to 45.92. Video‑COT pipelines generate chain‑of‑thought answers and achieve gains on VSI‑Bench (27.7 → 31.8) and MMVU (59.2 → 61.3).

Full‑Modality Data Adaptation

DataFlow ingests structured (SQL, CSV, Excel), semi‑structured (JSON/BSON, XML), and unstructured data (PDF, TXT). An Agent automatically detects field types, primary/foreign keys, and dimensions, eliminating manual ETL scripts. The Text2Data capability lets users retrieve and aggregate data across databases with a single natural‑language command.

TableAgent

TableAgent orchestrates a library of table‑processing operators for cleaning, transformation, augmentation, and matching. On a table‑cleaning benchmark it achieves 97.37% task completion and 78.51% success, far surpassing DeepAnalyze (92.11% / 57.85%) and ChatDev (38.89% / 27.09%).

DataFlow‑Graph – Knowledge‑Graph Construction

DataFlow‑Graph converts heterogeneous sources into a unified knowledge graph, handling cross‑domain attribute mismatches and update frequencies. A SFT pipeline built on the graph (knowledge‑graph → cleaning → entity alignment → rule matching → QA pair generation) trained on only ~2 k synthetic entries outperformed all public datasets in K12 book experiments.

Open‑Source Ecosystem

Main repository: https://github.com/OpenDCAI/DataFlow. The project is part of the OpenDCAI matrix, which also hosts agents, data‑governance tools, and related libraries (DataFlow‑Omni, WorldModel, AgentData, AI4S, Industry, etc.). Developers can import operators with familiar Python syntax, e.g., from DataFlow_XXXX import MyOperator, enabling independent development across teams.

Programming Example

Using a PyTorch‑style API, a workflow can be built by swapping prompts, allowing the Agent to automatically generate pipelines, compile them for static error checking, and resume from checkpoints via forward(resume_step).

Featured Pipelines

Pipeline 1 automates textbook question‑answer extraction for VQA training. Pipeline 2 synthesizes complex reasoning data; it won the KDD 2026 competition and achieved higher scores than SYNTHETIC‑1‑10K and Open‑R1‑10K on math and code benchmarks (e.g., MATH 62.8 → 73.8, GSM8K 67.1 → 88.2).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline LLM operator multimodal Ray synthetic data DataFlow

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.