Tagged articles

DataFlow

7 articles · Page 1 of 1

Jun 22, 2026 · Artificial Intelligence

Building DataFlow: An Industrial‑Grade LLM Data Pipeline from Documents to Training

The article presents DataFlow, an open‑source, GPU‑centric data‑engineering framework that tackles LLM data‑preparation bottlenecks by defining a two‑level operator taxonomy, a LLM‑driven WebAgent for automatic crawling, a PDF‑to‑Markdown MinerU, a Ray‑based distributed runtime, and extensive multimodal extensions, and validates the design with quantitative experiments showing significant quality gains across math, code, and reasoning benchmarks.

DataFlowLLMMultimodal

0 likes · 14 min read

Building DataFlow: An Industrial‑Grade LLM Data Pipeline from Documents to Training

AI Waka

May 22, 2026 · Artificial Intelligence

How Skills Can Cut Costs and Speed Up High‑Quality LLM Data Pipelines

The article explains how the open‑source DataFlow‑Skills framework lets LLM agents plan, validate, and execute data cleaning and synthesis pipelines with strict field contracts and specialized operators, dramatically reducing costly failures and accelerating high‑quality training data production.

AI data engineeringDataFlowLLM data pipelines

0 likes · 15 min read

How Skills Can Cut Costs and Speed Up High‑Quality LLM Data Pipelines

21CTO

Jan 24, 2025 · Fundamentals

Why Traditional Code Fails on Multicore CPUs and How Dataflow Languages Help

The article explains that despite decades of programming following the Von Neumann model, modern multicore processors expose limitations of sequential code, illustrates this with simple examples in Python and Go, and proposes data‑flow programming—exemplified by the experimental Nevalang language—as a more natural, parallel‑friendly paradigm.

DataFlowNevalangProgramming Paradigms

0 likes · 5 min read

Why Traditional Code Fails on Multicore CPUs and How Dataflow Languages Help

21CTO

Jul 15, 2024 · Big Data

Twitter’s Kappa Architecture: Scaling Real-Time Processing of Billions of Events

Twitter migrated from a Lambda-based dual‑pipeline system to a Kappa architecture that relies on a single real‑time stream using Kafka, Google Pub/Sub, Dataflow, and BigTable, dramatically reducing latency, increasing throughput, and improving data accuracy for processing billions of daily events.

Big DataCloud ComputingDataFlow

0 likes · 9 min read

Twitter’s Kappa Architecture: Scaling Real-Time Processing of Billions of Events

Programmer DD

Dec 9, 2020 · Big Data

Master Apache Beam: Build a Portable Word Count Pipeline in Minutes

This tutorial introduces Apache Beam’s unified programming model for batch and streaming, explains its core concepts and terminology, compares it with other runners, and walks through a complete Java word‑count example—including dependencies, pipeline construction, transforms, and execution with DirectRunner.

Apache BeamDataFlowDistributed Processing

0 likes · 8 min read

Master Apache Beam: Build a Portable Word Count Pipeline in Minutes

Big Data Technology & Architecture

Sep 16, 2019 · Big Data

Comprehensive Flink Interview Guide: Architecture, APIs, Operators, and Advanced Topics

This guide provides a detailed overview of Apache Flink covering its core streaming engine, APIs (DataSet, DataStream, Table), architectural components, comparison with Spark Streaming, partitioning, parallelism, restart strategies, state backends, time semantics, watermarks, SQL processing, fault‑tolerance mechanisms, memory management, serialization, RPC framework, back‑pressure handling, operator chaining, and practical tips for interview preparation.

Apache FlinkBig DataDataFlow

0 likes · 22 min read

Comprehensive Flink Interview Guide: Architecture, APIs, Operators, and Advanced Topics

Big Data Technology & Architecture

Jul 23, 2019 · Big Data

Understanding Google Dataflow: Model, Windowing, Triggers, and Incremental Processing

This article explains the Google Dataflow model, covering its unified batch‑and‑stream architecture, windowing and triggering mechanisms, core primitives, time domains, and how these concepts form the foundation of modern big‑data stream processing systems.

Big DataDataFlowGoogle Cloud

0 likes · 13 min read

Understanding Google Dataflow: Model, Windowing, Triggers, and Incremental Processing