Tagged articles
44 articles
Page 1 of 1
Data STUDIO
Data STUDIO
Dec 25, 2025 · Fundamentals

Boost Python Automation Efficiency with toolz: A Practical Refactoring Guide

This article explains how the pure‑Python functional library toolz can transform tangled automation scripts into clear, composable data pipelines, reducing code size, improving testability, and eliminating hidden technical debt through concrete examples and step‑by‑step refactoring.

AutomationPythondata pipelines
0 likes · 15 min read
Boost Python Automation Efficiency with toolz: A Practical Refactoring Guide
DevOps Coach
DevOps Coach
Sep 25, 2025 · Fundamentals

Unlock Python Speed: 12 Little‑Known Tricks to Turbocharge Your Code

Python is praised for its clarity but often deemed slow; this article reveals twelve overlooked, sometimes unconventional techniques—from using enumerate instead of range loops to leveraging Numba, Polars, and mypyc—that can dramatically accelerate data pipelines, APIs, and scientific workloads without rewriting code in another language.

PolarsProfilingPython
0 likes · 9 min read
Unlock Python Speed: 12 Little‑Known Tricks to Turbocharge Your Code
DevOps Cloud Academy
DevOps Cloud Academy
Sep 20, 2025 · Artificial Intelligence

How to Build Scalable AI Infrastructure: A Complete Guide

This article explains why robust AI infrastructure is essential, outlines its key components—from specialized hardware and orchestration platforms to security and governance—and provides a step‑by‑step roadmap, real‑world case studies, and best‑practice recommendations for constructing and continuously optimizing AI systems.

DevOpsdata pipelines
0 likes · 17 min read
How to Build Scalable AI Infrastructure: A Complete Guide
Alibaba Cloud Native
Alibaba Cloud Native
Jul 9, 2025 · Cloud Native

How SPL’s New Operators Supercharge Log‑to‑Metric Processing in Cloud‑Native Environments

The article introduces SPL’s latest operators—pack-fields, log-to-metric, and metric-to-metric—explaining their smart field aggregation, trimming, type inference, wildcard matching, and label manipulation capabilities, and demonstrates through code examples and performance benchmarks how they dramatically improve data processing efficiency and observability in cloud‑native log services.

Cloud NativeLog ProcessingSPL
0 likes · 10 min read
How SPL’s New Operators Supercharge Log‑to‑Metric Processing in Cloud‑Native Environments
Linux Cloud Computing Practice
Linux Cloud Computing Practice
May 29, 2025 · Big Data

Why Learn Kafka? Core Benefits, Use Cases, and Key Interview Topics

This article explains why Kafka is essential for modern data engineering, highlighting its widespread adoption, high throughput, scalability, durability, integration with streaming ecosystems, and common real‑time use cases, while also providing a concise list of interview topics for aspiring engineers.

Real-time ProcessingStreamingdata pipelines
0 likes · 6 min read
Why Learn Kafka? Core Benefits, Use Cases, and Key Interview Topics
Airbnb Technology Team
Airbnb Technology Team
Mar 24, 2025 · Artificial Intelligence

Chronon: Open‑Source Feature Platform for Machine Learning – Architecture, Workflow, and Code Examples

Chronon is an open‑source ML feature platform that lets engineers declaratively define, compute, and serve both batch and real‑time features with built‑in observability, data‑quality checks, and a low‑latency retrieval API, ensuring online‑offline consistency while simplifying pipeline management and enabling future automation.

ChrononObservabilityStreaming
0 likes · 13 min read
Chronon: Open‑Source Feature Platform for Machine Learning – Architecture, Workflow, and Code Examples
DataFunSummit
DataFunSummit
Mar 3, 2025 · Artificial Intelligence

DeepSeek Open Source Week: Seven Core Technologies Reshaping Large‑Model Training

The DeepSeek open‑source week introduced seven breakthrough technologies—FlashMLA, DeepGEMM, DeepEP, DualPipe, EPLB, 3FS, and Smallpond—that together overhaul data flow, algorithmic complexity, hardware utilization, MoE communication, and resource balancing, dramatically improving large‑model training efficiency and lowering entry barriers for the AI industry.

AI hardwareDeepSeekdata pipelines
0 likes · 17 min read
DeepSeek Open Source Week: Seven Core Technologies Reshaping Large‑Model Training
DataFunSummit
DataFunSummit
Feb 24, 2025 · Big Data

Building Real-Time Data Synchronization Pipelines with Apache SeaTunnel

Apache SeaTunnel is an open‑source, distributed data integration platform that enables efficient real‑time data synchronization across diverse sources and destinations, supporting both streaming and batch processing, with detailed architecture, connector plugins, CDC handling, transform capabilities, and deployment strategies for large‑scale data pipelines.

Apache SeaTunnelCDCReal-Time Data Integration
0 likes · 34 min read
Building Real-Time Data Synchronization Pipelines with Apache SeaTunnel
macrozheng
macrozheng
Dec 20, 2024 · Big Data

Master Data Pipelines with Kestra: Open‑Source Workflow Engine Explained

This article introduces the open‑source Kestra workflow engine, outlines its key features for building scalable data pipelines, provides step‑by‑step Docker installation and YAML workflow examples, and showcases its visual UI for monitoring and managing complex ETL and automation tasks.

DockerKestraWorkflow Orchestration
0 likes · 6 min read
Master Data Pipelines with Kestra: Open‑Source Workflow Engine Explained
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 29, 2024 · Big Data

How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray

The article details ByteDance's use of Ray and RayData to construct scalable audio and video data processing pipelines for multimodal AI models, addressing challenges of massive data volume, resource constraints, and fault tolerance through pipeline design, RayCore enhancements, and custom scheduling optimizations.

AIBig DataByteDance
0 likes · 16 min read
How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray
DataFunSummit
DataFunSummit
Sep 24, 2024 · Artificial Intelligence

Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training

The article discusses the challenges of training ever‑larger AI models on internet‑scale data, critiques traditional batch ETL pipelines, and proposes a streaming data‑flow architecture with dynamic data selection and a shared‑memory/Alluxio middle layer to decouple data processing from model training, improving efficiency and scalability.

AI InfrastructureMultimodal Datadata pipelines
0 likes · 20 min read
Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training
Alibaba Cloud Observability
Alibaba Cloud Observability
May 29, 2024 · Big Data

How SPL Boosts iLogtail 2.0: Combining Performance and Flexibility in Log Processing

This article traces the evolution of streaming processing languages, compares iLogtail's native and extended pipeline modes, and demonstrates how the new SPL syntax in iLogtail 2.0 delivers high‑performance, flexible log and time‑series data processing with unified, SQL‑like commands and interactive debugging tools.

Log AnalyticsSPLdata pipelines
0 likes · 13 min read
How SPL Boosts iLogtail 2.0: Combining Performance and Flexibility in Log Processing
Test Development Learning Exchange
Test Development Learning Exchange
Mar 31, 2024 · Big Data

Apache Airflow Overview and Advanced Usage Examples

This article introduces Apache Airflow, explains its core concepts such as DAGs, tasks, operators, executors, and the web UI, and provides multiple practical Python code examples for Bash commands, Python functions, SQL queries, task dependencies, sensors, dynamic DAGs, SubDAGs, XCom, email alerts, and error handling.

Apache AirflowDAGPython
0 likes · 7 min read
Apache Airflow Overview and Advanced Usage Examples
Inke Technology
Inke Technology
Nov 24, 2023 · Backend Development

Building a Scalable Overseas Ad Platform: Architecture, Permissions & Automation

To support rapid overseas expansion, the article outlines a comprehensive backend architecture—including management, data ingestion, device tracking, attribution, and offline tasks—while detailing fine-grained user permission controls, automated product onboarding, batch ad creation, and server‑side attribution workflows, plus future enhancements.

Backendadvertising platformdata pipelines
0 likes · 12 min read
Building a Scalable Overseas Ad Platform: Architecture, Permissions & Automation
DataFunTalk
DataFunTalk
Aug 3, 2023 · Game Development

Applying A/B Testing to Drive Growth in Tencent Overseas Games

This article explains how Tencent leverages A/B testing across its overseas games, detailing market differences, experimental methodology, multi‑cloud platform compliance, data architecture, and case studies that illustrate how targeted experiments improve user onboarding, gameplay settings, and email‑based re‑engagement.

A/B testingGame Analyticsdata pipelines
0 likes · 12 min read
Applying A/B Testing to Drive Growth in Tencent Overseas Games
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 31, 2023 · Big Data

From BI to Kappa: How Data Architecture Evolved in the Big Data Era

This article traces the evolution of data architecture from early BI systems through traditional big‑data stacks, streaming, Lambda and Kappa designs, and explains how a unified stream‑batch model simplifies development while keeping logic consistent across data‑analysis and pipeline applications.

BI systemsBig DataData Architecture
0 likes · 16 min read
From BI to Kappa: How Data Architecture Evolved in the Big Data Era
SQB Blog
SQB Blog
Jul 20, 2023 · Artificial Intelligence

How We Built and Optimized a Multi‑Pool Recommendation System for Boss Circle

This article explains the design, implementation, and iterative optimization of Boss Circle's recommendation engine, covering the initial simple ranking, the introduction of Elasticsearch‑based scoring, multi‑pool data sources, machine‑learning experiments, real‑time feature handling, and future personalization challenges.

Elasticsearchdata pipelinespersonalization
0 likes · 17 min read
How We Built and Optimized a Multi‑Pool Recommendation System for Boss Circle
Java Architecture Diary
Java Architecture Diary
Jul 5, 2023 · Cloud Native

Deploy and Explore StreamPipes: A Self‑Service Industrial IoT Toolbox

This guide introduces StreamPipes, an end‑to‑end industrial IoT toolbox for non‑technical users, outlines its key features, shows how to connect data sources, build pipelines, visualize data, and provides step‑by‑step Docker‑Compose installation, configuration, and development instructions.

Docker ComposeIndustrial IoTStreamPipes
0 likes · 8 min read
Deploy and Explore StreamPipes: A Self‑Service Industrial IoT Toolbox
DevOps Cloud Academy
DevOps Cloud Academy
Oct 15, 2022 · Big Data

Introduction to Apache Airflow

Apache Airflow is an open‑source platform for programmatically authoring, scheduling, and monitoring workflows using Directed Acyclic Graphs (DAGs), featuring components such as Scheduler, Web Server, Database, and various Executors, and offering easy‑to‑use, extensible, scalable, and robust integrations for data pipeline management.

Apache AirflowDAGExecutor
0 likes · 10 min read
Introduction to Apache Airflow
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Oct 12, 2022 · Artificial Intelligence

Unlock Vision AI: How EasyCV Streamlines Datasets and Model Training

This article introduces EasyCV, an open‑source all‑in‑one visual algorithm platform that abstracts diverse data sources, provides SOTA self‑supervised models, and offers ready‑to‑download datasets for image classification, object detection, segmentation, and pose estimation, complete with configuration examples.

Computer VisionDatasetsDeep Learning
0 likes · 9 min read
Unlock Vision AI: How EasyCV Streamlines Datasets and Model Training
DataFunSummit
DataFunSummit
Aug 25, 2022 · Big Data

Managing the Full Lifecycle of Risk Features: Pitfalls, Solutions, and Future Directions

The talk by Tang Gengyang from Citic Baixin Bank details the challenges faced in risk feature engineering, presents two solution frameworks (1.0 and 2.0) for accelerating deployment, improving reuse, handling offline/online consistency, and outlines future enhancements for a more efficient, automated feature pipeline.

Flinkasynchronous processingdata pipelines
0 likes · 14 min read
Managing the Full Lifecycle of Risk Features: Pitfalls, Solutions, and Future Directions
Top Architect
Top Architect
Jun 7, 2022 · Databases

An Introduction to Change Data Capture (CDC) Practices and Modern Approaches

This article introduces the concept of Change Data Capture (CDC), explains why traditional batch reporting strains resources, describes how CDC captures only data changes to keep source databases performant, and outlines modern CDC architectures, production‑ready considerations, and best‑practice guidelines for building reliable data pipelines.

CDCChange Data CaptureData Integration
0 likes · 16 min read
An Introduction to Change Data Capture (CDC) Practices and Modern Approaches
Big Data Technology Architecture
Big Data Technology Architecture
Jun 3, 2022 · Operations

Understanding Apache Airflow DAGs, Operators, and Scheduling

This article explains Apache Airflow's core concepts, including DAG definitions, scheduling intervals, task dependencies, various operators such as BashOperator, PythonOperator, Branch operators, sensors, and custom operators, and provides code examples and configuration details for building robust data pipelines.

Apache AirflowDAGScheduling
0 likes · 15 min read
Understanding Apache Airflow DAGs, Operators, and Scheduling
Big Data Technology Architecture
Big Data Technology Architecture
May 31, 2022 · Big Data

Comprehensive Guide to Installing and Using Apache Airflow with Docker on Windows

This article provides a detailed tutorial on Apache Airflow fundamentals, Docker-based installation on Windows, Dockerfile creation, container deployment via Docker run and Docker Compose, Airflow configuration, and practical usage of DAGs, tasks, connections, and UI features for data pipeline orchestration.

Apache AirflowDockerDocker Compose
0 likes · 14 min read
Comprehensive Guide to Installing and Using Apache Airflow with Docker on Windows
21CTO
21CTO
Nov 1, 2021 · Big Data

Essential Data Engineering Roadmap: Skills, Tools, and Technologies to Master

This guide outlines the fast‑growing data engineering career path, covering essential Linux fundamentals, programming languages, testing, database concepts, data warehouses, processing frameworks, messaging systems, cluster computing, workflow scheduling, monitoring, infrastructure as code, and CI/CD tools.

Big Datadata engineeringdata pipelines
0 likes · 5 min read
Essential Data Engineering Roadmap: Skills, Tools, and Technologies to Master
Programmer DD
Programmer DD
Sep 26, 2021 · Big Data

What’s New in Apache Kafka 3.0? Key Features and Improvements Explained

Apache Kafka 3.0.0 introduces a host of enhancements—including deprecated Java 8/Scala 2.12 support, Raft metadata snapshots, stronger producer guarantees, MirrorMaker 2 upgrades, and Kafka Streams improvements—while continuing to serve real‑time data pipelines and streaming applications.

Apache KafkaBig DataKafka 3.0
0 likes · 3 min read
What’s New in Apache Kafka 3.0? Key Features and Improvements Explained
Efficient Ops
Efficient Ops
Jun 1, 2021 · Artificial Intelligence

How Time‑Series Analysis Powers AIOps: Overcoming Real‑World Challenges

At the 16th GOPS Global Operations Conference, Shen Hui of DingMao Technology explained how time‑series data analysis underpins AIOps, outlining its four‑step workflow, key challenges, and the company’s three‑pipeline solution that enables trend forecasting, fault prediction, and a robust AI‑driven operational platform.

AIOperationsTime Series Analysis
0 likes · 7 min read
How Time‑Series Analysis Powers AIOps: Overcoming Real‑World Challenges
DataFunTalk
DataFunTalk
Mar 18, 2021 · Fundamentals

Building Popper: Tubi’s Scalable Experimentation Platform

Tubi’s Popper platform combines a Scala‑based experiment engine, reproducible JSON‑stored configurations, a React UI, and data pipelines using Spark and Akka to enable fast, cross‑team A/B testing, automated analysis, health checks, and data‑driven decision making across mobile and OTT services.

A/B testingAkkaExperimentation platform
0 likes · 15 min read
Building Popper: Tubi’s Scalable Experimentation Platform
DataFunSummit
DataFunSummit
Feb 4, 2021 · Artificial Intelligence

Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details

This article examines the evolution of user data, computing power, and models, and presents the design principles, key architectural factors, and practical implementation techniques for building a full‑stack machine learning platform that supports large‑scale data processing, distributed training, and low‑latency online serving.

Big Data IntegrationMachine Learning Platformdata pipelines
0 likes · 15 min read
Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details
58 Tech
58 Tech
Dec 23, 2019 · Operations

Intelligent Duty Robot for Real‑Estate Data Job Monitoring and Automation

The article describes an intelligent duty robot that uses a sense‑think‑act framework and job dependency graphs to automatically monitor, diagnose, and remediate data pipeline jobs in a real‑estate platform, reducing operational pressure and achieving over 98% notification accuracy.

AutomationOperational EfficiencyReal Estate Data
0 likes · 9 min read
Intelligent Duty Robot for Real‑Estate Data Job Monitoring and Automation
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 22, 2019 · Backend Development

Understanding Back Pressure in Flink and Its Implementation

The article explains what back pressure is in Flink streaming jobs, why it occurs when data generation outpaces downstream consumption, how Flink monitors it via stack‑trace sampling, configurable parameters, Web UI visualization, and compares the approach with Spark Streaming's back pressure mechanism.

FlinkSparkdata pipelines
0 likes · 5 min read
Understanding Back Pressure in Flink and Its Implementation
AntTech
AntTech
May 7, 2019 · Artificial Intelligence

SQLFlow: Bridging SQL Engines and AI Platforms for End‑to‑End Machine Learning

SQLFlow is an open‑source project that connects diverse SQL engines (MySQL, Hive, SparkSQL, etc.) with AI frameworks (TensorFlow, PyTorch, XGBoost, etc.) through extended SQL syntax, enabling analysts to train and predict models using only a few SQL statements while aiming for high scalability and performance.

AI integrationGoSQL
0 likes · 13 min read
SQLFlow: Bridging SQL Engines and AI Platforms for End‑to‑End Machine Learning
Ctrip Technology
Ctrip Technology
Feb 13, 2019 · Artificial Intelligence

Understanding TensorFlow Extended (TFX): Concepts, Data Preparation, and Model Deployment

This article introduces TensorFlow Extended (TFX), illustrating practical TensorFlow examples such as ship trajectory classification, insurance premium adjustments, and car auction pricing, then explains TFX’s data validation, schema generation, model analysis, and deployment options to streamline machine‑learning pipelines.

AITFXTensorFlow
0 likes · 12 min read
Understanding TensorFlow Extended (TFX): Concepts, Data Preparation, and Model Deployment
58 Tech
58 Tech
Jan 11, 2019 · Artificial Intelligence

Design and Implementation of an End-to-End Efficiency Optimization Platform for 58.com Classified Listings

This article describes the design and implementation of a comprehensive efficiency‑optimization platform at 58.com, detailing its end‑to‑end workflow—from log aggregation and feature extraction through machine learning model training and online experimentation—highlighting modular, configurable, and scalable solutions for multi‑business, multi‑product ranking.

click-through rateconversion ratedata pipelines
0 likes · 25 min read
Design and Implementation of an End-to-End Efficiency Optimization Platform for 58.com Classified Listings
21CTO
21CTO
Feb 20, 2018 · Big Data

Why Real-Time Streaming Is the Next Big Data Revolution for Developers

This article explains how real-time streaming has evolved from batch Hadoop systems through Lambda architecture to modern Kappa-style pipelines, highlighting its growing importance for developers, enterprises, and the integration of streaming with microservices, AI, and cloud-native technologies.

AI integrationBig DataKappa architecture
0 likes · 8 min read
Why Real-Time Streaming Is the Next Big Data Revolution for Developers
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 4, 2017 · Big Data

How Alibaba Powered Double 11 with Real‑Time Big Data Processing

Alibaba’s Double 11 live‑data dashboards required ultra‑high‑precision, low‑latency real‑time processing of billions of events, and the article explains the end‑to‑end architecture—including DRC, TimeTunnel, Galaxy, OTS, XTool, and OneService—used to achieve million‑plus QPS, fault‑tolerance, and flexible data collection.

AlibabaBig Data ArchitectureReal-time Streaming
0 likes · 14 min read
How Alibaba Powered Double 11 with Real‑Time Big Data Processing
21CTO
21CTO
Sep 28, 2015 · Artificial Intelligence

How Meituan Built a Scalable AI‑Powered Recommendation Engine

This article details Meituan's end‑to‑end recommendation system, covering its four‑layer architecture, data sources, candidate‑generation strategies, fusion methods, and both linear and non‑linear re‑ranking models, while highlighting practical optimizations like AB testing and online learning.

MeituanOnline Learningdata pipelines
0 likes · 15 min read
How Meituan Built a Scalable AI‑Powered Recommendation Engine
Qunar Tech Salon
Qunar Tech Salon
Jul 12, 2015 · Big Data

Airbnb OpenAir Conference: Open‑Source Tools Airpal, Aerosolve, and Airflow

At Airbnb’s inaugural OpenAir conference, the company unveiled three open‑source big‑data tools—Airpal, a Presto‑based visual SQL query engine; Aerosolve, an interpretable machine‑learning engine for pricing recommendations; and Airflow, an internal platform for orchestrating and monitoring data pipelines.

AirbnbBig DataOpenAir
0 likes · 4 min read
Airbnb OpenAir Conference: Open‑Source Tools Airpal, Aerosolve, and Airflow