Tagged articles

data pipelines

45 articles · Page 1 of 1

Dec 25, 2025 · Fundamentals

Boost Python Automation Efficiency with toolz: A Practical Refactoring Guide

This article explains how the pure‑Python functional library toolz can transform tangled automation scripts into clear, composable data pipelines, reducing code size, improving testability, and eliminating hidden technical debt through concrete examples and step‑by‑step refactoring.

AutomationFunctional ProgrammingPython

0 likes · 15 min read

Boost Python Automation Efficiency with toolz: A Practical Refactoring Guide

DevOps Coach

Sep 25, 2025 · Fundamentals

Unlock Python Speed: 12 Little‑Known Tricks to Turbocharge Your Code

Python is praised for its clarity but often deemed slow; this article reveals twelve overlooked, sometimes unconventional techniques—from using enumerate instead of range loops to leveraging Numba, Polars, and mypyc—that can dramatically accelerate data pipelines, APIs, and scientific workloads without rewriting code in another language.

OptimizationPolarsProfiling

0 likes · 9 min read

Unlock Python Speed: 12 Little‑Known Tricks to Turbocharge Your Code

DevOps Cloud Academy

Sep 20, 2025 · Artificial Intelligence

How to Build Scalable AI Infrastructure: A Complete Guide

This article explains why robust AI infrastructure is essential, outlines its key components—from specialized hardware and orchestration platforms to security and governance—and provides a step‑by‑step roadmap, real‑world case studies, and best‑practice recommendations for constructing and continuously optimizing AI systems.

data pipelinesdevops

0 likes · 17 min read

How to Build Scalable AI Infrastructure: A Complete Guide

Amazon Cloud Developers

Sep 9, 2025 · Big Data

Build an Intelligent Data Quality Monitoring System with Strands Agents

The article explains why data quality is critical for modern businesses, outlines the challenges data engineers face when monitoring hundreds of models, and provides a step‑by‑step guide to constructing an automated, AI‑driven data quality monitoring system using Strands Agents, Amazon Bedrock, dbt and Amazon Redshift, including code examples, workflow orchestration, security hardening, containerized deployment and observability.

AI AgentsAmazon RedshiftData Quality

0 likes · 14 min read

Build an Intelligent Data Quality Monitoring System with Strands Agents

Alibaba Cloud Native

Jul 9, 2025 · Cloud Native

How SPL’s New Operators Supercharge Log‑to‑Metric Processing in Cloud‑Native Environments

The article introduces SPL’s latest operators—pack-fields, log-to-metric, and metric-to-metric—explaining their smart field aggregation, trimming, type inference, wildcard matching, and label manipulation capabilities, and demonstrates through code examples and performance benchmarks how they dramatically improve data processing efficiency and observability in cloud‑native log services.

Cloud NativeLog ProcessingSPL

0 likes · 10 min read

How SPL’s New Operators Supercharge Log‑to‑Metric Processing in Cloud‑Native Environments

Su San Talks Tech

Jul 1, 2025 · Big Data

How to Build Lightweight Batch Jobs with Spring Batch: A Practical Guide

This article explains the need for lightweight batch processing, outlines a layered architecture and robustness strategies, and demonstrates how Spring Batch implements these concepts with clear interfaces, job management, and support for ignore, retry, and restart mechanisms.

Batch ProcessingJavaRobustness

0 likes · 10 min read

How to Build Lightweight Batch Jobs with Spring Batch: A Practical Guide

Alibaba Cloud Infrastructure

Jun 27, 2025 · Cloud Native

Why Argo Workflows Is the Leading Cloud‑Native Engine for AI & Data Pipelines

Argo Workflows, the top‑rated CNCF project, extends Kubernetes to orchestrate AI, ML, and data pipelines with a scalable, cloud‑native architecture, offering powerful scheduling, Python SDK support, and new plugins for Spark, Ray, and PyTorch.

AIArgo WorkflowsCloud Native

0 likes · 9 min read

Why Argo Workflows Is the Leading Cloud‑Native Engine for AI & Data Pipelines

Linux Cloud Computing Practice

May 29, 2025 · Big Data

Why Learn Kafka? Core Benefits, Use Cases, and Key Interview Topics

This article explains why Kafka is essential for modern data engineering, highlighting its widespread adoption, high throughput, scalability, durability, integration with streaming ecosystems, and common real‑time use cases, while also providing a concise list of interview topics for aspiring engineers.

Real-time ProcessingStreamingdata pipelines

0 likes · 6 min read

Why Learn Kafka? Core Benefits, Use Cases, and Key Interview Topics

Airbnb Technology Team

Mar 24, 2025 · Artificial Intelligence

Chronon: Open‑Source Feature Platform for Machine Learning – Architecture, Workflow, and Code Examples

Chronon is an open‑source ML feature platform that lets engineers declaratively define, compute, and serve both batch and real‑time features with built‑in observability, data‑quality checks, and a low‑latency retrieval API, ensuring online‑offline consistency while simplifying pipeline management and enabling future automation.

ChrononObservabilityStreaming

0 likes · 13 min read

Chronon: Open‑Source Feature Platform for Machine Learning – Architecture, Workflow, and Code Examples

DataFunSummit

Mar 3, 2025 · Artificial Intelligence

DeepSeek Open Source Week: Seven Core Technologies Reshaping Large‑Model Training

The DeepSeek open‑source week introduced seven breakthrough technologies—FlashMLA, DeepGEMM, DeepEP, DualPipe, EPLB, 3FS, and Smallpond—that together overhaul data flow, algorithmic complexity, hardware utilization, MoE communication, and resource balancing, dramatically improving large‑model training efficiency and lowering entry barriers for the AI industry.

AI hardwareDeepSeekdata pipelines

0 likes · 17 min read

DeepSeek Open Source Week: Seven Core Technologies Reshaping Large‑Model Training

DataFunSummit

Feb 24, 2025 · Big Data

Building Real-Time Data Synchronization Pipelines with Apache SeaTunnel

Apache SeaTunnel is an open‑source, distributed data integration platform that enables efficient real‑time data synchronization across diverse sources and destinations, supporting both streaming and batch processing, with detailed architecture, connector plugins, CDC handling, transform capabilities, and deployment strategies for large‑scale data pipelines.

Apache SeaTunnelCDCReal-Time Data Integration

0 likes · 34 min read

Building Real-Time Data Synchronization Pipelines with Apache SeaTunnel

macrozheng

Dec 20, 2024 · Big Data

Master Data Pipelines with Kestra: Open‑Source Workflow Engine Explained

This article introduces the open‑source Kestra workflow engine, outlines its key features for building scalable data pipelines, provides step‑by‑step Docker installation and YAML workflow examples, and showcases its visual UI for monitoring and managing complex ETL and automation tasks.

DockerKestraWorkflow Orchestration

0 likes · 6 min read

Master Data Pipelines with Kestra: Open‑Source Workflow Engine Explained

Rare Earth Juejin Tech Community

Nov 29, 2024 · Big Data

How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray

The article details ByteDance's use of Ray and RayData to construct scalable audio and video data processing pipelines for multimodal AI models, addressing challenges of massive data volume, resource constraints, and fault tolerance through pipeline design, RayCore enhancements, and custom scheduling optimizations.

AIBig DataByteDance

0 likes · 16 min read

How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray

DataFunSummit

Sep 24, 2024 · Artificial Intelligence

Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training

The article discusses the challenges of training ever‑larger AI models on internet‑scale data, critiques traditional batch ETL pipelines, and proposes a streaming data‑flow architecture with dynamic data selection and a shared‑memory/Alluxio middle layer to decouple data processing from model training, improving efficiency and scalability.

AI InfrastructureMultimodal Datadata pipelines

0 likes · 20 min read

Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training

Alibaba Cloud Observability

May 29, 2024 · Big Data

How SPL Boosts iLogtail 2.0: Combining Performance and Flexibility in Log Processing

This article traces the evolution of streaming processing languages, compares iLogtail's native and extended pipeline modes, and demonstrates how the new SPL syntax in iLogtail 2.0 delivers high‑performance, flexible log and time‑series data processing with unified, SQL‑like commands and interactive debugging tools.

Log AnalyticsSPLdata pipelines

0 likes · 13 min read

How SPL Boosts iLogtail 2.0: Combining Performance and Flexibility in Log Processing

Test Development Learning Exchange

Mar 31, 2024 · Big Data

Apache Airflow Overview and Advanced Usage Examples

This article introduces Apache Airflow, explains its core concepts such as DAGs, tasks, operators, executors, and the web UI, and provides multiple practical Python code examples for Bash commands, Python functions, SQL queries, task dependencies, sensors, dynamic DAGs, SubDAGs, XCom, email alerts, and error handling.

Apache AirflowDAGPython

0 likes · 7 min read

Apache Airflow Overview and Advanced Usage Examples

Inke Technology

Nov 24, 2023 · Backend Development

Building a Scalable Overseas Ad Platform: Architecture, Permissions & Automation

To support rapid overseas expansion, the article outlines a comprehensive backend architecture—including management, data ingestion, device tracking, attribution, and offline tasks—while detailing fine-grained user permission controls, automated product onboarding, batch ad creation, and server‑side attribution workflows, plus future enhancements.

advertising platformbackenddata pipelines

0 likes · 12 min read

Building a Scalable Overseas Ad Platform: Architecture, Permissions & Automation

DataFunTalk

Aug 3, 2023 · Game Development

Applying A/B Testing to Drive Growth in Tencent Overseas Games

This article explains how Tencent leverages A/B testing across its overseas games, detailing market differences, experimental methodology, multi‑cloud platform compliance, data architecture, and case studies that illustrate how targeted experiments improve user onboarding, gameplay settings, and email‑based re‑engagement.

A/B testingGame Analyticsdata pipelines

0 likes · 12 min read

Applying A/B Testing to Drive Growth in Tencent Overseas Games

Alibaba Cloud Developer

Jul 31, 2023 · Big Data

From BI to Kappa: How Data Architecture Evolved in the Big Data Era

This article traces the evolution of data architecture from early BI systems through traditional big‑data stacks, streaming, Lambda and Kappa designs, and explains how a unified stream‑batch model simplifies development while keeping logic consistent across data‑analysis and pipeline applications.

BI systemsBig DataData Architecture

0 likes · 16 min read

From BI to Kappa: How Data Architecture Evolved in the Big Data Era

SQB Blog

Jul 20, 2023 · Artificial Intelligence

How We Built and Optimized a Multi‑Pool Recommendation System for Boss Circle

This article explains the design, implementation, and iterative optimization of Boss Circle's recommendation engine, covering the initial simple ranking, the introduction of Elasticsearch‑based scoring, multi‑pool data sources, machine‑learning experiments, real‑time feature handling, and future personalization challenges.

ElasticsearchRankingdata pipelines

0 likes · 17 min read

How We Built and Optimized a Multi‑Pool Recommendation System for Boss Circle

Java Architecture Diary

Jul 5, 2023 · Cloud Native

Deploy and Explore StreamPipes: A Self‑Service Industrial IoT Toolbox

This guide introduces StreamPipes, an end‑to‑end industrial IoT toolbox for non‑technical users, outlines its key features, shows how to connect data sources, build pipelines, visualize data, and provides step‑by‑step Docker‑Compose installation, configuration, and development instructions.

Docker ComposeIndustrial IoTStreamPipes

0 likes · 8 min read

Deploy and Explore StreamPipes: A Self‑Service Industrial IoT Toolbox

DevOps Cloud Academy

Feb 28, 2023 · Operations

Understanding Apache Airflow Celery Executor: Architecture, Setup, and Task Execution

This article explains how Apache Airflow's Celery Executor works, covering its key features, installation steps, configuration details, architectural components, and the complete task execution process that enables scalable, distributed workflow orchestration for data pipelines.

Apache AirflowCelery ExecutorTask scheduling

0 likes · 15 min read

Understanding Apache Airflow Celery Executor: Architecture, Setup, and Task Execution

DevOps Cloud Academy

Oct 20, 2022 · Big Data

Installing Apache Airflow, Creating Users, and Using Basic Commands

This guide explains how to install Apache Airflow in a virtual environment, set up the Airflow home, create an admin user, understand role‑based access control, and run essential Airflow CLI commands for managing DAGs and tasks.

Airflow RolesApache AirflowCLI

0 likes · 6 min read

Installing Apache Airflow, Creating Users, and Using Basic Commands

DevOps Cloud Academy

Oct 15, 2022 · Big Data

Introduction to Apache Airflow

Apache Airflow is an open‑source platform for programmatically authoring, scheduling, and monitoring workflows using Directed Acyclic Graphs (DAGs), featuring components such as Scheduler, Web Server, Database, and various Executors, and offering easy‑to‑use, extensible, scalable, and robust integrations for data pipeline management.

Apache AirflowDAGExecutor

0 likes · 10 min read

Alibaba Cloud Big Data AI Platform

Oct 12, 2022 · Artificial Intelligence

Unlock Vision AI: How EasyCV Streamlines Datasets and Model Training

This article introduces EasyCV, an open‑source all‑in‑one visual algorithm platform that abstracts diverse data sources, provides SOTA self‑supervised models, and offers ready‑to‑download datasets for image classification, object detection, segmentation, and pose estimation, complete with configuration examples.

EasyCVcomputer visiondata pipelines

0 likes · 9 min read

Unlock Vision AI: How EasyCV Streamlines Datasets and Model Training

DataFunSummit

Aug 25, 2022 · Big Data

Managing the Full Lifecycle of Risk Features: Pitfalls, Solutions, and Future Directions

The talk by Tang Gengyang from Citic Baixin Bank details the challenges faced in risk feature engineering, presents two solution frameworks (1.0 and 2.0) for accelerating deployment, improving reuse, handling offline/online consistency, and outlines future enhancements for a more efficient, automated feature pipeline.

Flinkasynchronous processingdata pipelines

0 likes · 14 min read

Managing the Full Lifecycle of Risk Features: Pitfalls, Solutions, and Future Directions

Top Architect

Jun 7, 2022 · Databases

An Introduction to Change Data Capture (CDC) Practices and Modern Approaches

This article introduces the concept of Change Data Capture (CDC), explains why traditional batch reporting strains resources, describes how CDC captures only data changes to keep source databases performant, and outlines modern CDC architectures, production‑ready considerations, and best‑practice guidelines for building reliable data pipelines.

CDCChange Data CaptureData Integration

0 likes · 16 min read

An Introduction to Change Data Capture (CDC) Practices and Modern Approaches

Big Data Technology Architecture

Jun 3, 2022 · Operations

Understanding Apache Airflow DAGs, Operators, and Scheduling

This article explains Apache Airflow's core concepts, including DAG definitions, scheduling intervals, task dependencies, various operators such as BashOperator, PythonOperator, Branch operators, sensors, and custom operators, and provides code examples and configuration details for building robust data pipelines.

Apache AirflowDAGScheduling

0 likes · 15 min read

Understanding Apache Airflow DAGs, Operators, and Scheduling

Big Data Technology Architecture

May 31, 2022 · Big Data

Comprehensive Guide to Installing and Using Apache Airflow with Docker on Windows

This article provides a detailed tutorial on Apache Airflow fundamentals, Docker-based installation on Windows, Dockerfile creation, container deployment via Docker run and Docker Compose, Airflow configuration, and practical usage of DAGs, tasks, connections, and UI features for data pipeline orchestration.

Apache AirflowDockerDocker Compose

0 likes · 14 min read

Comprehensive Guide to Installing and Using Apache Airflow with Docker on Windows

ByteDance Data Platform

Apr 8, 2022 · Operations

How Baseline Monitoring Transforms Data Pipeline Reliability at ByteDance

This article explains ByteDance's baseline monitoring system for data pipelines, detailing its motivation, core concepts, architecture, instance generation, alert types, and handling of complex task dependencies to reduce operational costs and improve SLA compliance across hundreds of projects.

AlertingBig Databaseline monitoring

0 likes · 21 min read

How Baseline Monitoring Transforms Data Pipeline Reliability at ByteDance

21CTO

Nov 1, 2021 · Big Data

Essential Data Engineering Roadmap: Skills, Tools, and Technologies to Master

This guide outlines the fast‑growing data engineering career path, covering essential Linux fundamentals, programming languages, testing, database concepts, data warehouses, processing frameworks, messaging systems, cluster computing, workflow scheduling, monitoring, infrastructure as code, and CI/CD tools.

Big DataData Engineeringdata pipelines

0 likes · 5 min read

Essential Data Engineering Roadmap: Skills, Tools, and Technologies to Master

IT Architects Alliance

Oct 13, 2021 · Industry Insights

Is Kafka Still Worth the Effort? Rethinking Data Pipeline Costs and Alternatives

The article examines Apache Kafka's strengths and shortcomings, explores the operational complexities of managing a Kafka deployment, and encourages organizations to reassess its value versus emerging alternatives by weighing maturity, scalability, and total cost of ownership.

Alternative PlatformsOperational ChallengesStreaming

0 likes · 12 min read

Is Kafka Still Worth the Effort? Rethinking Data Pipeline Costs and Alternatives

Programmer DD

Sep 26, 2021 · Big Data

What’s New in Apache Kafka 3.0? Key Features and Improvements Explained

Apache Kafka 3.0.0 introduces a host of enhancements—including deprecated Java 8/Scala 2.12 support, Raft metadata snapshots, stronger producer guarantees, MirrorMaker 2 upgrades, and Kafka Streams improvements—while continuing to serve real‑time data pipelines and streaming applications.

Apache KafkaBig DataKafka 3.0

0 likes · 3 min read

What’s New in Apache Kafka 3.0? Key Features and Improvements Explained

Efficient Ops

Jun 1, 2021 · Artificial Intelligence

How Time‑Series Analysis Powers AIOps: Overcoming Real‑World Challenges

At the 16th GOPS Global Operations Conference, Shen Hui of DingMao Technology explained how time‑series data analysis underpins AIOps, outlining its four‑step workflow, key challenges, and the company’s three‑pipeline solution that enables trend forecasting, fault prediction, and a robust AI‑driven operational platform.

AIAIOpsOperations

0 likes · 7 min read

How Time‑Series Analysis Powers AIOps: Overcoming Real‑World Challenges

DataFunTalk

Mar 18, 2021 · Fundamentals

Building Popper: Tubi’s Scalable Experimentation Platform

Tubi’s Popper platform combines a Scala‑based experiment engine, reproducible JSON‑stored configurations, a React UI, and data pipelines using Spark and Akka to enable fast, cross‑team A/B testing, automated analysis, health checks, and data‑driven decision making across mobile and OTT services.

A/B testingAkkaExperimentation platform

0 likes · 15 min read

Building Popper: Tubi’s Scalable Experimentation Platform

DataFunSummit

Feb 4, 2021 · Artificial Intelligence

Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details

This article examines the evolution of user data, computing power, and models, and presents the design principles, key architectural factors, and practical implementation techniques for building a full‑stack machine learning platform that supports large‑scale data processing, distributed training, and low‑latency online serving.

Machine Learning PlatformResource Schedulingbig data integration

0 likes · 15 min read

Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details

58 Tech

Dec 23, 2019 · Operations

Intelligent Duty Robot for Real‑Estate Data Job Monitoring and Automation

The article describes an intelligent duty robot that uses a sense‑think‑act framework and job dependency graphs to automatically monitor, diagnose, and remediate data pipeline jobs in a real‑estate platform, reducing operational pressure and achieving over 98% notification accuracy.

AutomationReal Estate Datadata pipelines

0 likes · 9 min read

Intelligent Duty Robot for Real‑Estate Data Job Monitoring and Automation

Big Data Technology & Architecture

Jun 22, 2019 · Backend Development

Understanding Back Pressure in Flink and Its Implementation

The article explains what back pressure is in Flink streaming jobs, why it occurs when data generation outpaces downstream consumption, how Flink monitors it via stack‑trace sampling, configurable parameters, Web UI visualization, and compares the approach with Spark Streaming's back pressure mechanism.

FlinkSparkdata pipelines

0 likes · 5 min read

Understanding Back Pressure in Flink and Its Implementation

AntTech

May 7, 2019 · Artificial Intelligence

SQLFlow: Bridging SQL Engines and AI Platforms for End‑to‑End Machine Learning

SQLFlow is an open‑source project that connects diverse SQL engines (MySQL, Hive, SparkSQL, etc.) with AI frameworks (TensorFlow, PyTorch, XGBoost, etc.) through extended SQL syntax, enabling analysts to train and predict models using only a few SQL statements while aiming for high scalability and performance.

AI integrationGoSQL

0 likes · 13 min read

SQLFlow: Bridging SQL Engines and AI Platforms for End‑to‑End Machine Learning

Ctrip Technology

Feb 13, 2019 · Artificial Intelligence

Understanding TensorFlow Extended (TFX): Concepts, Data Preparation, and Model Deployment

This article introduces TensorFlow Extended (TFX), illustrating practical TensorFlow examples such as ship trajectory classification, insurance premium adjustments, and car auction pricing, then explains TFX’s data validation, schema generation, model analysis, and deployment options to streamline machine‑learning pipelines.

AITFXTensorFlow

0 likes · 12 min read

Understanding TensorFlow Extended (TFX): Concepts, Data Preparation, and Model Deployment

58 Tech

Jan 11, 2019 · Artificial Intelligence

Design and Implementation of an End-to-End Efficiency Optimization Platform for 58.com Classified Listings

This article describes the design and implementation of a comprehensive efficiency‑optimization platform at 58.com, detailing its end‑to‑end workflow—from log aggregation and feature extraction through machine learning model training and online experimentation—highlighting modular, configurable, and scalable solutions for multi‑business, multi‑product ranking.

click-through rateconversion ratedata pipelines

0 likes · 25 min read

Design and Implementation of an End-to-End Efficiency Optimization Platform for 58.com Classified Listings

21CTO

Feb 20, 2018 · Big Data

Why Real-Time Streaming Is the Next Big Data Revolution for Developers

This article explains how real-time streaming has evolved from batch Hadoop systems through Lambda architecture to modern Kappa-style pipelines, highlighting its growing importance for developers, enterprises, and the integration of streaming with microservices, AI, and cloud-native technologies.

AI integrationBig DataKappa architecture

0 likes · 8 min read

Why Real-Time Streaming Is the Next Big Data Revolution for Developers

Alibaba Cloud Developer

Jan 4, 2017 · Big Data

How Alibaba Powered Double 11 with Real‑Time Big Data Processing

Alibaba’s Double 11 live‑data dashboards required ultra‑high‑precision, low‑latency real‑time processing of billions of events, and the article explains the end‑to‑end architecture—including DRC, TimeTunnel, Galaxy, OTS, XTool, and OneService—used to achieve million‑plus QPS, fault‑tolerance, and flexible data collection.

AlibabaBig Data ArchitectureReal-time Streaming

0 likes · 14 min read

How Alibaba Powered Double 11 with Real‑Time Big Data Processing

21CTO

Sep 28, 2015 · Artificial Intelligence

How Meituan Built a Scalable AI‑Powered Recommendation Engine

This article details Meituan's end‑to‑end recommendation system, covering its four‑layer architecture, data sources, candidate‑generation strategies, fusion methods, and both linear and non‑linear re‑ranking models, while highlighting practical optimizations like AB testing and online learning.

Meituandata pipelinesmachine learning

0 likes · 15 min read

How Meituan Built a Scalable AI‑Powered Recommendation Engine

Qunar Tech Salon

Jul 12, 2015 · Big Data

Airbnb OpenAir Conference: Open‑Source Tools Airpal, Aerosolve, and Airflow

At Airbnb’s inaugural OpenAir conference, the company unveiled three open‑source big‑data tools—Airpal, a Presto‑based visual SQL query engine; Aerosolve, an interpretable machine‑learning engine for pricing recommendations; and Airflow, an internal platform for orchestrating and monitoring data pipelines.

AirbnbBig DataOpenAir

0 likes · 4 min read

Airbnb OpenAir Conference: Open‑Source Tools Airpal, Aerosolve, and Airflow