Tagged articles

ETL

304 articles · Page 1 of 4

Jun 28, 2026 · Big Data

How Cainiao Uses DataWorks Data Agent to Deploy AI-Powered SuperETL

Cainiao combines a decade of logistics data-warehouse experience with Alibaba Cloud’s DataWorks Data Agent to build the SuperETL intelligent system, which orchestrates nine fine-grained skills, enforces safety hooks, and boosts data-development efficiency by 2-3× while achieving over 80% AI automation in key scenarios.

AICainiaoData Agent

0 likes · 12 min read

How Cainiao Uses DataWorks Data Agent to Deploy AI-Powered SuperETL

dbaplus Community

Jun 23, 2026 · Big Data

How AI‑Powered Skills Cut 70% of Repetitive Data Development Work

A real‑world incident where an ADS table stopped updating triggered a three‑second root‑cause discovery and a three‑hour data‑warehouse rebuild using a Claude‑based Skill that eliminated about 70% of the manual, repetitive steps traditionally required in data development, testing, deployment, and operations.

AI automationClaudeData Development

0 likes · 12 min read

How AI‑Powered Skills Cut 70% of Repetitive Data Development Work

AI Engineer Programming

Jun 20, 2026 · Artificial Intelligence

RAG Data Ingestion: Managing Heterogeneous Sources and Unified Metadata

The article analyzes common pitfalls in RAG data ingestion—connection failures and incomplete records—advocates defining required metadata fields before integration, and provides source‑specific guidelines for databases, APIs, object storage, web crawlers, and manual uploads to ensure reliable downstream governance.

AIETLKnowledge Base

0 likes · 17 min read

RAG Data Ingestion: Managing Heterogeneous Sources and Unified Metadata

Spring Full-Stack Practical Cases

Jun 6, 2026 · Artificial Intelligence

Essential ETL Techniques for Spring AI RAG – A Must‑Read Guide

This article explains how Spring AI implements the ETL pipeline for Retrieval‑Augmented Generation, detailing the three core components—DocumentReader, DocumentTransformer, and DocumentWriter—along with concrete code examples, configuration parameters, and processing steps for text, PDF, and Tika document sources.

DocumentReaderETLKeywordMetadataEnricher

0 likes · 11 min read

Essential ETL Techniques for Spring AI RAG – A Must‑Read Guide

Alibaba Cloud Big Data AI Platform

May 27, 2026 · Artificial Intelligence

DataWorks Data Agent Powers AI‑Driven Data Development: 2‑3× Faster and 80% Automation with SuperETL

The article details how DataWorks Data Agent integrates logistics industry standards and a skill‑based orchestration to overhaul the data development workflow, delivering 2‑3× efficiency gains and up to 80% AI‑automated task completion through SuperETL, hooks, and CLI tools.

AIAutomationData Engineering

0 likes · 10 min read

DataWorks Data Agent Powers AI‑Driven Data Development: 2‑3× Faster and 80% Automation with SuperETL

Big Data Tech Team

May 24, 2026 · Big Data

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

This guide compiles the most frequent interview pitfalls for data warehouse roles, covering SQL join and aggregation errors, window function misuse, subquery versus CTE performance myths, dimensional modeling mistakes, SCD implementation traps, layered design issues, data quality handling, ETL traps, Hive and Spark performance questions, real‑time warehousing considerations, and effective interview strategies.

Big DataETLHive

0 likes · 3 min read

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

Big Data Tech Team

May 19, 2026 · Big Data

Enterprise Data Warehouse Development Playbook: Standard Engineering Edition

This playbook provides enterprise‑level data warehouse engineers, ETL developers, data modelers, and data‑team managers with a complete, logical, and actionable set of standards, processes, and best‑practice guidelines covering architecture, development principles, role responsibilities, end‑to‑end workflow, metadata, security, performance metrics, and team collaboration.

Data QualityETLMetadata

0 likes · 18 min read

Enterprise Data Warehouse Development Playbook: Standard Engineering Edition

Architect's Guide

May 9, 2026 · Databases

Alibaba’s Open‑Source DataX: Fast, Easy Offline Data Synchronization

This article introduces Alibaba’s open‑source DataX tool, explains its framework‑plugin architecture for heterogeneous database sync, walks through Linux installation, job configuration, full‑ and incremental MySQL synchronization, and shares performance results and practical tips.

Data synchronizationDataXETL

0 likes · 15 min read

Alibaba’s Open‑Source DataX: Fast, Easy Offline Data Synchronization

Open Source Tech Hub

Apr 8, 2026 · Backend Development

Master Efficient PHP Data Pipelines with the Low‑Memory Flow Framework

This article introduces the Flow PHP data‑processing framework, highlights its ultra‑low memory footprint and extensible pipeline capabilities, and provides step‑by‑step installation and code examples for handling in‑memory arrays and CSV files in ETL workflows.

ETLPHPbackend

0 likes · 4 min read

Master Efficient PHP Data Pipelines with the Low‑Memory Flow Framework

Big Data Tech Team

Jan 26, 2026 · Big Data

Master DWD, DWS, and Wide‑Table Modeling for Scalable Data Warehouses

This guide explains the DWD (detail) and DWS (summary) layered modeling approach combined with wide‑table driving, covering model positioning, design principles, concrete schema examples, implementation techniques, performance tips, and common pitfalls to help build clean, reusable, high‑performance enterprise data warehouses.

DWDDWSData Warehouse

0 likes · 9 min read

Master DWD, DWS, and Wide‑Table Modeling for Scalable Data Warehouses

Big Data Tech Team

Jan 15, 2026 · Big Data

Mastering Data Warehousing: Core Concepts, Tools, and Future Trends

This article outlines a comprehensive roadmap for data warehousing, covering fundamental concepts, essential big‑data tools, practical implementation steps, advanced architectural topics, and emerging trends such as cloud‑native warehouses and machine‑learning integration, helping readers build a solid knowledge base.

Cloud Data WarehouseData WarehouseETL

0 likes · 9 min read

Mastering Data Warehousing: Core Concepts, Tools, and Future Trends

Big Data Tech Team

Jan 12, 2026 · Fundamentals

Why Wide Tables Are Essential in DWS Layer: 10 Real-World Modeling Scenarios

This article explains the purpose of the DWS (Data Warehouse Service) layer, why wide‑table modeling is crucial for performance and service‑oriented interfaces, and provides ten practical wide‑table designs with core field definitions, CREATE TABLE statements, and sample INSERT queries for common business domains such as products, users, orders, regions, channels, suppliers, services, finance, logistics, and data quality monitoring.

AnalyticsETLSQL

0 likes · 34 min read

Why Wide Tables Are Essential in DWS Layer: 10 Real-World Modeling Scenarios

Big Data Tech Team

Jan 12, 2026 · Big Data

Avoid the 5 Fatal DWS Design Traps and Build Scalable Data Warehouses

This article dissects the five most common pitfalls when transitioning from DWD to DWS aggregation tables—such as chimney‑style designs, over‑wide tables, grain mismatches, missing drill‑down keys, and performance neglect—and offers concrete, production‑ready solutions to create reusable, efficient, and cost‑effective data‑warehouse layers.

DWS DesignData WarehouseETL

0 likes · 9 min read

Avoid the 5 Fatal DWS Design Traps and Build Scalable Data Warehouses

Big Data Tech Team

Dec 25, 2025 · Big Data

How to Build an End‑to‑End E‑Commerce Data Warehouse for Interview Success

This guide walks you through designing and implementing a complete e‑commerce data‑warehouse project—from raw data ingestion and ODS/DWD/DWS/ADS layers to optional real‑time analytics—while highlighting interview‑ready resume tips, common pitfalls, and performance‑tuning tricks.

Big DataETLFlink

0 likes · 10 min read

How to Build an End‑to‑End E‑Commerce Data Warehouse for Interview Success

Java Architect Handbook

Dec 15, 2025 · Industry Insights

How DBSyncer Simplifies Multi‑Source Data Synchronization Across Databases

The article introduces the open‑source DBSyncer middleware that enables full‑stack data synchronization across MySQL, Oracle, SQL Server, PostgreSQL, Elasticsearch and Kafka, outlines its visual composition, full‑ and incremental sync, real‑time monitoring, and provides step‑by‑step installation instructions while also mentioning related Java learning projects.

DBSyncerData synchronizationETL

0 likes · 6 min read

How DBSyncer Simplifies Multi‑Source Data Synchronization Across Databases

AI Insight Log

Dec 8, 2025 · Artificial Intelligence

How to Teach Claude Any Framework in 20 Minutes with Skill Seekers

This article explains how the open‑source Skill Seekers tool automates the extraction, cleaning, structuring, and packaging of documentation, code repositories, and PDFs into Claude‑compatible Skills, enabling rapid onboarding of obscure frameworks while highlighting conflict detection, MCP integration, and practical usage tips.

AI ToolingClaudeETL

0 likes · 9 min read

How to Teach Claude Any Framework in 20 Minutes with Skill Seekers

Top Architect

Dec 1, 2025 · Big Data

Master DataX: Fast MySQL‑to‑MySQL Data Synchronization and Incremental Updates

This guide walks you through installing JDK, Python and DataX on Linux, configuring MySQL sources, creating the necessary tables and stored procedures, and using DataX's JSON job definitions to perform both full‑load and incremental data synchronization between two MySQL instances, complete with performance metrics and troubleshooting tips.

Data synchronizationDataXETL

0 likes · 16 min read

Master DataX: Fast MySQL‑to‑MySQL Data Synchronization and Incremental Updates

Java Architect Handbook

Nov 23, 2025 · Big Data

Master Data Synchronization with Alibaba DataX: From Installation to Incremental Sync

This guide explains how to use Alibaba's open‑source DataX tool to synchronize large MySQL datasets, covering the tool’s architecture, installation on Linux, job configuration with JSON, full‑load and incremental sync examples, and performance results, all without relying on mysqldump or manual storage methods.

Big DataData synchronizationDataX

0 likes · 17 min read

Master Data Synchronization with Alibaba DataX: From Installation to Incremental Sync

Alibaba Cloud Developer

Nov 20, 2025 · Big Data

Mastering Large‑Scale Data Migration: Challenges, Strategies and Real‑World Solutions

This article explains why data migration is the essential first step for cloud modernization, outlines the technical challenges of moving terabytes to petabytes, compares physical and logical migration methods, and presents practical solutions and real‑world case studies across Hive, cloud warehouses, lake‑house formats and analytic databases.

Big DataData MigrationETL

0 likes · 56 min read

Mastering Large‑Scale Data Migration: Challenges, Strategies and Real‑World Solutions

Ray's Galactic Tech

Nov 18, 2025 · Big Data

Master Spark SQL: From DataFrames to Catalyst Optimization and Real-World Use Cases

This comprehensive guide walks you through Spark SQL fundamentals—including DataFrame and Dataset APIs—delves into the Catalyst optimizer and Tungsten engine, presents practical Java examples, and shares concrete tuning techniques and real-world ETL scenarios for handling large‑scale data.

CatalystETLOptimization

0 likes · 8 min read

Master Spark SQL: From DataFrames to Catalyst Optimization and Real-World Use Cases

Alibaba Cloud Developer

Nov 7, 2025 · Big Data

Unlock Enterprise‑Grade Data Pipelines with DMS Airflow: Features, Integration & Code Samples

This article introduces DMS Airflow, an enterprise‑level data workflow orchestration platform built on Apache Airflow, covering its advanced DAG capabilities, deep DMS integration, scheduling, task dependency management, dynamic task generation, resource scaling, security features, and practical code examples for SQL, Spark, DTS, and Notebook tasks.

AirflowBig DataDMS

0 likes · 20 min read

Unlock Enterprise‑Grade Data Pipelines with DMS Airflow: Features, Integration & Code Samples

Big Data Tech Team

Oct 30, 2025 · Big Data

Mastering the ADS Layer: Design Principles, Modeling, and Real‑Time Data Services

This article provides a comprehensive analysis of the ADS (Application Data Service) layer in a data‑warehouse architecture, covering its core positioning, design goals, modeling strategies, dimension‑optimization techniques, API services, typical challenges, and practical best‑practice recommendations for high‑performance, flexible, and secure data delivery.

ADS layerETLSQL

0 likes · 8 min read

Mastering the ADS Layer: Design Principles, Modeling, and Real‑Time Data Services

Selected Java Interview Questions

Oct 21, 2025 · Big Data

How to Sync Massive MySQL Datasets Efficiently with DataX

This guide walks through the challenges of synchronizing tens of millions of records between heterogeneous MySQL databases, explains why traditional mysqldump or file‑based methods fail, and provides a step‑by‑step tutorial on installing, configuring, and using Alibaba's open‑source DataX tool for both full and incremental data synchronization.

Big DataData synchronizationDataX

0 likes · 15 min read

How to Sync Massive MySQL Datasets Efficiently with DataX

Instant Consumer Technology Team

Oct 14, 2025 · Big Data

How to Boost Spark SQL DAG Efficiency with Regex‑Driven Temporary Views

This article explains how to reduce intermediate tables, simplify dependencies, and improve execution efficiency in Spark SQL pipelines by using session‑level temporary views and regex‑based SQL parsing to automatically merge and rewrite DAG tasks in large‑scale data platforms.

Big DataDAG OptimizationETL

0 likes · 13 min read

How to Boost Spark SQL DAG Efficiency with Regex‑Driven Temporary Views

Baidu Geek Talk

Oct 13, 2025 · Big Data

How Baidu Scaled Its Data Warehouse to Handle Billions of PVs and Petabytes

This article details Baidu APP's massive data‑warehouse overhaul, describing the two‑step strategy that stabilized log cleaning, modernized the ETL framework, introduced wide‑table architectures, and implemented tiered storage to dramatically improve processing speed, reliability, and cost efficiency for petabyte‑scale workloads.

Big DataData WarehouseETL

0 likes · 25 min read

How Baidu Scaled Its Data Warehouse to Handle Billions of PVs and Petabytes

Big Data Tech Team

Sep 15, 2025 · Interview Experience

When to Use Materialized Views in Production: Benefits, Types, and Pitfalls

This article explains what materialized views are, outlines their advantages such as query acceleration, lightweight ETL, and lake‑warehouse integration, classifies them by sync mode, table count, and refresh strategy, and highlights their limitations and best‑practice recommendations for production use.

Data WarehousingDatabase PerformanceETL

0 likes · 6 min read

When to Use Materialized Views in Production: Benefits, Types, and Pitfalls

Alibaba Cloud Big Data AI Platform

Aug 4, 2025 · Big Data

Decoupling Ops Troubleshooting: Building a DataOps Warehouse with ETL

This article explains how to transform traditional SRE troubleshooting into a data‑driven process by pre‑collecting operational metrics into a data warehouse, using ETL to create layered data models (ODS, DIM, DWD, DWS) that enable efficient, repeatable analysis while balancing data freshness and storage costs.

Big DataData WarehouseDataOps

0 likes · 7 min read

Decoupling Ops Troubleshooting: Building a DataOps Warehouse with ETL

Big Data Tech Team

Jul 23, 2025 · Big Data

From Beginner to Data Warehouse Architect: A Complete Roadmap

This guide walks you through every essential topic—from data warehouse architecture and layering, through ETL, OLAP, Hadoop, and Flink, to visualization tools, learning paths, recommended resources, and the management skills needed to become a proficient data warehouse architect.

Data WarehouseETLFlink

0 likes · 9 min read

Architect

Jul 7, 2025 · Big Data

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

This article explains how Baidu’s search data team redesigned its data warehouse with wide‑table modeling, Parquet columnar storage, and a Spark‑ClickHouse fusion engine, eliminating redundancy, cutting query latency from minutes to seconds, and enabling self‑service analytics for thousands of users.

Data WarehouseETLParquet

0 likes · 21 min read

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

Architect's Guide

Jun 14, 2025 · Big Data

Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling

This article explains the core components of a data warehouse ecosystem, distinguishes fact and dimension tables, outlines synchronization strategies, introduces star, snowflake, and constellation schemas, and details the layered architecture from ODS to data marts for effective big‑data analytics.

Data WarehouseETLFact Table

0 likes · 15 min read

Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling

Su San Talks Tech

May 29, 2025 · Big Data

How to Sync Massive MySQL Data with Alibaba DataX – Step‑by‑Step Guide

Facing a 50‑million‑row project with inaccurate reports and cross‑database operations, this guide explains why mysqldump and simple storage methods fail, introduces Alibaba’s open‑source DataX middleware, details its architecture, installation, and step‑by‑step configurations for full and incremental MySQL data synchronization.

Data synchronizationDataXETL

0 likes · 14 min read

How to Sync Massive MySQL Data with Alibaba DataX – Step‑by‑Step Guide

Zhuanzhuan Tech

May 21, 2025 · Big Data

How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse

This article details the evolution of a fast‑growing e‑commerce finance platform from a monolithic microservice architecture plagued by data inconsistency, low processing efficiency, and scalability limits to a robust, distributed big‑data warehouse using SparkSQL, layered data models, and optimized scheduling, achieving ten‑fold performance gains and near‑zero failure rates.

Big DataData WarehouseETL

0 likes · 21 min read

How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse

Java Backend Technology

May 21, 2025 · Big Data

Master DataX: Fast Offline Data Sync for MySQL without mysqldump

This guide explains how to use Alibaba's open‑source DataX tool to perform high‑performance offline synchronization between heterogeneous MySQL databases, covering installation, framework design, job configuration, full‑ and incremental sync, and practical command‑line examples.

Big DataData synchronizationDataX

0 likes · 15 min read

Master DataX: Fast Offline Data Sync for MySQL without mysqldump

Java Tech Enthusiast

May 13, 2025 · Big Data

Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync

This article introduces Alibaba DataX 3.0, explains its architecture and role‑based design, walks through Linux installation, JDK setup, MySQL preparation, and provides step‑by‑step examples of full‑load and incremental data synchronization between two MySQL instances using JSON job configurations and command‑line execution.

Data synchronizationDataXETL

0 likes · 14 min read

Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync

macrozheng

May 12, 2025 · Big Data

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Learn how to overcome inaccurate reporting and cross-database challenges by using Alibaba’s open-source DataX tool to efficiently synchronize massive MySQL datasets, covering its architecture, job scheduling, installation, configuration, full- and incremental sync, and practical command-line examples.

Big DataData synchronizationDataX

0 likes · 15 min read

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Top Architect

May 7, 2025 · Big Data

Using DataX for Efficient MySQL Data Synchronization

This article provides a comprehensive guide on using Alibaba's open‑source DataX tool for efficient offline synchronization between heterogeneous databases such as MySQL, covering its architecture, installation on Linux, job configuration, full‑ and incremental data transfer, and practical code examples.

Big DataData synchronizationDataX

0 likes · 18 min read

Using DataX for Efficient MySQL Data Synchronization

Architecture Digest

May 6, 2025 · Big Data

Using DataX for Efficient Data Synchronization Between MySQL Databases

This article explains how to employ Alibaba's open‑source DataX tool to perform fast, reliable full‑ and incremental data synchronization between MySQL instances, covering installation, framework design, job execution, and practical shell commands for Linux environments.

Big DataData synchronizationDataX

0 likes · 16 min read

Using DataX for Efficient Data Synchronization Between MySQL Databases

Big Data Tech Team

Apr 26, 2025 · Big Data

Mastering the Data Development Roadmap: From Infrastructure to AI Integration

This guide outlines a comprehensive data development roadmap, covering infrastructure setup, governance frameworks, automated pipelines, BI and analytics tools, AI/ML integration, cultural adoption, and continuous performance monitoring to enable intelligent business transformation.

AI integrationAnalyticsBig Data

0 likes · 5 min read

Mastering the Data Development Roadmap: From Infrastructure to AI Integration

ITPUB

Apr 17, 2025 · Databases

Migrate 700TB Over 2Mbps: Scripts, Sneakernet & Practical Steps

When a manager demands a script to move a 700‑terabyte database under a 2 Mbps bandwidth cap, the realistic solution combines physical Sneakernet transfer with a carefully staged export‑transform‑load script that handles field mapping, compression, rate‑limited transport, and fault‑tolerant import.

ETLJavalarge data transfer

0 likes · 8 min read

Migrate 700TB Over 2Mbps: Scripts, Sneakernet & Practical Steps

Big Data Tech Team

Apr 16, 2025 · Operations

Mastering Data Warehouse Naming: A Complete Guide to Standards and Processes

This article provides a comprehensive, step‑by‑step guide to data‑warehouse development, covering the full R&D workflow, data modeling layers, data dictionary creation, naming conventions for tables, columns, indexes and ETL jobs, metric standardization, and governance processes to ensure consistent, maintainable data assets across the organization.

ETLMetadatadata dictionary

0 likes · 28 min read

Mastering Data Warehouse Naming: A Complete Guide to Standards and Processes

Big Data Tech Team

Mar 17, 2025 · Big Data

How to Design and Review a Data Warehouse Model: A Complete Guide

This document outlines a comprehensive data warehouse model design and review process, covering revision records, project overview, business requirements, conceptual and logical modeling, ETL workflow, exception handling, and acceptance criteria with practical examples and templates.

Data WarehouseETLModel Design

0 likes · 6 min read

How to Design and Review a Data Warehouse Model: A Complete Guide

Ma Wei Says

Mar 16, 2025 · Databases

Mastering Slowly Changing Dimensions: Which SCD Strategy Fits Your Data Warehouse?

This article explains the concept of Slowly Changing Dimensions (SCD) in data warehouses, compares six common SCD handling methods—including SCD0, SCD1, SCD2, SCD3, combined SCD2+SCD3, and historical tables—and guides you on selecting the most suitable approach for your business needs.

Data WarehouseETLSCD Types

0 likes · 9 min read

Mastering Slowly Changing Dimensions: Which SCD Strategy Fits Your Data Warehouse?

Ma Wei Says

Mar 11, 2025 · Big Data

Mastering DWS Layer Design: Principles, Steps, and Best Practices

This article explains the role of the DWS layer in data warehouses, outlines design principles, step‑by‑step modeling, naming conventions, field design, provides concrete DDL/ETL examples, common pitfalls, and how to build reusable, performant summary tables for analytics.

Big DataDWS LayerData Warehouse

0 likes · 15 min read

Mastering DWS Layer Design: Principles, Steps, and Best Practices

Ma Wei Says

Feb 26, 2025 · Databases

Understanding Fact Tables: Types, Granularity, and Design Best Practices

This article explains fact tables in data warehousing, covering their definition, granularity, additive classifications, null handling, consistency rules, and the various types such as transaction, snapshot, cumulative, fact‑less, and aggregate tables, along with design trade‑offs and ETL considerations.

BIETLdimensional modeling

0 likes · 17 min read

Understanding Fact Tables: Types, Granularity, and Design Best Practices

vivo Internet Technology

Dec 18, 2024 · Big Data

Kafka Streams: Architecture, Configuration, and Monitoring Use Cases

Kafka Streams is a client library that enables low‑latency, fault‑tolerant real‑time processing of Kafka data through configurable topologies, time semantics, and state stores, and the article explains its architecture, essential configurations, monitoring‑focused ETL example, performance tuning, and strategies for handling partition skew.

Big DataETLJava

0 likes · 25 min read

Kafka Streams: Architecture, Configuration, and Monitoring Use Cases

Test Development Learning Exchange

Dec 1, 2024 · Big Data

How to Install Apache Airflow and Build a Simple Data Processing Pipeline

This tutorial guides you through installing Apache Airflow, initializing its database, starting the web server and scheduler, creating a Python DAG that reads, cleans, groups, and saves CSV data, configuring the DAG directory, and monitoring the pipeline via the Airflow web UI.

Apache AirflowDAGETL

0 likes · 6 min read

How to Install Apache Airflow and Build a Simple Data Processing Pipeline

Big Data Technology & Architecture

Oct 21, 2024 · Big Data

Key New Features of Apache Doris 3.0: Storage‑Compute Separation, Lakehouse Integration, Semi‑Structured Data, ETL Enhancements, Materialized Views, and Java UDTF

Apache Doris 3.0 introduces storage‑compute separation, native lakehouse write‑back, optimized Variant handling for semi‑structured data, stronger ETL transaction support, enhanced multi‑table materialized views, and Java UDTF capabilities, providing developers with more flexible, cost‑effective, and high‑performance analytics solutions.

Apache DorisData WarehouseETL

0 likes · 7 min read

Key New Features of Apache Doris 3.0: Storage‑Compute Separation, Lakehouse Integration, Semi‑Structured Data, ETL Enhancements, Materialized Views, and Java UDTF

Huolala Tech

Oct 17, 2024 · Big Data

How Filing 1.0 Revolutionizes Heterogeneous Data Archiving for High‑Scale Transactions

Filing 1.0 is a no‑code, heterogeneous data‑archiving platform that unifies MySQL, HBase, Hive, and Elasticsearch, addressing massive order volumes, multi‑domain requirements, and hot‑cold data separation through a star‑shaped architecture, flexible scheduling, and a four‑component archiving engine.

Data ArchivingDistributed ProcessingETL

0 likes · 14 min read

How Filing 1.0 Revolutionizes Heterogeneous Data Archiving for High‑Scale Transactions

macrozheng

Sep 27, 2024 · Big Data

Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

This guide walks through the challenges of synchronizing massive datasets across heterogeneous databases, introduces Alibaba's open‑source DataX tool, explains its framework‑plugin architecture, and provides step‑by‑step instructions—including environment setup, installation, job configuration, and both full and incremental MySQL synchronization—complete with code examples and performance metrics.

Big DataData IntegrationDataX

0 likes · 15 min read

Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

dbaplus Community

Sep 5, 2024 · Databases

How to Migrate Data from MongoDB to MySQL Using DuckDB

This guide explains how to export MongoDB collections to JSON, load them into DuckDB, generate compatible table schemas, and then transfer the data efficiently into MySQL using DuckDB as an intermediate processing engine.

Data MigrationDuckDBETL

0 likes · 6 min read

How to Migrate Data from MongoDB to MySQL Using DuckDB

Alibaba Cloud Developer

Sep 3, 2024 · Big Data

Mastering Data Modeling: From Raw Data to Insightful Warehouses

This article walks through the fundamentals of data modeling, explaining what data is, the DIKW framework, why modeling matters, and detailing the end‑to‑end process from conceptual design through logical and physical layers, including DIM, DWD, DWS, and ADM tables with practical tips and naming conventions.

Data WarehouseETLdata modeling

0 likes · 11 min read

Mastering Data Modeling: From Raw Data to Insightful Warehouses

IT Xianyu

Aug 26, 2024 · Big Data

Hive Data Warehouse: Modeling, Partitioning, and ID‑Mapping for User Profiles

This article explains how Hive serves as a data‑warehouse layer for user‑profile tagging, covering data‑warehouse fundamentals, fact‑and‑dimension modeling, partitioned storage, label aggregation, and ID‑mapping techniques with practical Hive DDL/DML examples.

Big DataData WarehouseETL

0 likes · 11 min read

Hive Data Warehouse: Modeling, Partitioning, and ID‑Mapping for User Profiles

DataFunTalk

Aug 8, 2024 · Big Data

Building a User Profile Data Warehouse at 58.com: Architecture, Modeling, and Practices

This article details the design and implementation of a user‑profile data warehouse at 58.com, covering data‑warehouse fundamentals, user‑profile tag generation, layered architecture, dimensional modeling choices, ETL migration from Hive to Spark, data‑quality safeguards, and the resulting scale of tables, metrics and tags.

ETLdimensional modelinguser profiling

0 likes · 20 min read

DataFunTalk

Jul 10, 2024 · Big Data

Apache SeaTunnel: A Next‑Generation Data Integration Platform for ETL/ELT and OLAP

This article introduces Apache SeaTunnel, a modern data integration platform designed for the EtLT era, detailing its architecture, core connector APIs, checkpoint mechanism, model inference, multi‑table synchronization, the high‑performance SeaTunnel Zeta engine, OLAP use cases, community roadmap, and the commercial WhaleTunnel product.

Apache SeaTunnelBig DataELT

0 likes · 22 min read

Apache SeaTunnel: A Next‑Generation Data Integration Platform for ETL/ELT and OLAP

DaTaobao Tech

Jul 8, 2024 · Big Data

ODPS (MaxCompute) SQL Basics, Data Integration and Hologres Import Guide

This guide provides a comprehensive, beginner‑to‑advanced reference for ODPS (MaxCompute) SQL, covering table creation, DDL/DML commands, query syntax, join hints, MySQL‑to‑ODPS synchronization, one‑click and custom imports into Hologres, and scheduling variables for automated data pipelines.

Data IntegrationETLHologres

0 likes · 37 min read

ODPS (MaxCompute) SQL Basics, Data Integration and Hologres Import Guide

DevOps

Jun 27, 2024 · Big Data

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

This article explores agile data engineering, advocating code‑as‑infrastructure practices such as code‑everything, data and code reuse, and ETL‑level continuous integration, while discussing the trade‑offs between data‑centric and code‑centric reuse, cost considerations, and practical implementation tips for modern data projects.

Big DataCode as InfrastructureData Engineering

0 likes · 22 min read

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

dbaplus Community

May 27, 2024 · Backend Development

Why Cache Warm‑up Is Critical and How to Do It Effectively

The article recounts a painful production incident caused by missing cache warm‑up, explains why pre‑loading caches is essential for performance and reliability, and presents practical strategies such as gray‑scale rollout, database scanning, and ETL‑driven cache filling.

Backend EngineeringCache Warm-upETL

0 likes · 8 min read

Why Cache Warm‑up Is Critical and How to Do It Effectively

Big Data Technology & Architecture

May 27, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

The Athena Data Factory, built by Spark Thinking, is a comprehensive one‑stop data development and governance platform that integrates data integration, development, analysis, and services, offering offline, real‑time, and AI pipelines, modular architecture, extensive monitoring, and cost‑optimisation to empower thousands of users across the company.

AirflowBig DataCloud Computing

0 likes · 26 min read

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

DataFunTalk

May 26, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking

The article details how Sparkle Thinking built the Athena Data Factory—a comprehensive, self‑service data development and governance platform that integrates data integration, ETL, real‑time processing, monitoring, and analytics, describing its architecture, key technologies, implementation timeline, operational practices, performance gains, and future directions.

AirflowETLFlink

0 likes · 26 min read

Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking

DataFunTalk

May 13, 2024 · Big Data

Data Integration Maturity Model: From ETL to EtLT

The article examines the evolution of data integration architectures—from traditional ETL through ELT to the emerging EtLT model—highlighting their advantages, disadvantages, industry trends, maturity stages, and practical guidance for enterprises and professionals navigating modern big‑data pipelines.

Big DataData IntegrationDataOps

0 likes · 31 min read

Data Integration Maturity Model: From ETL to EtLT

Test Development Learning Exchange

May 9, 2024 · Fundamentals

Getting Started with petl: Installation, Basic Operations, and Practical Examples

This article introduces the Python petl library for easy ETL tasks, explains how to install it via pip, and demonstrates core operations such as loading CSV data, viewing, filtering, sorting, converting, aggregating, joining, deduplicating, and performing basic statistical analysis with clear code examples.

ETLPythonpetl

0 likes · 4 min read

Getting Started with petl: Installation, Basic Operations, and Practical Examples

DataFunSummit

May 2, 2024 · Big Data

Building an Attribution System for NetEase Cloud Music Data Warehouse: Challenges and Solutions

This article presents the problems faced by NetEase Cloud Music's data warehouse attribution system and details a comprehensive solution that includes upgrading the event‑tracking framework, redesigning the attribution model, and launching a unified management platform to improve stability, accuracy, and scalability.

AnalyticsBig DataData Warehouse

0 likes · 13 min read

Building an Attribution System for NetEase Cloud Music Data Warehouse: Challenges and Solutions

Alibaba Cloud Native

Mar 27, 2024 · Big Data

How to Route Kafka Messages to MongoDB DML with Alibaba Cloud Function Compute

This guide explains how to use Alibaba Cloud Function Compute to inspect Kafka message keys and automatically perform insert, update, or delete operations on MongoDB, detailing the architecture, advantages, prerequisites, step‑by‑step deployment, and current limitations.

Big DataETLFunction Compute

0 likes · 8 min read

How to Route Kafka Messages to MongoDB DML with Alibaba Cloud Function Compute

dbaplus Community

Mar 24, 2024 · Databases

How StarRocks Revamped Ctrip’s Ticket Metrics Platform for Lightning‑Fast Queries

Ctrip’s ticket business rebuilt its multi‑engine metrics platform by consolidating ClickHouse, Kylin, and Presto into a single StarRocks database, introducing temporary tables, materialized views, and streamlined ETL, which cut complex query times from minutes to seconds and doubled user traffic.

Data WarehouseETLMetrics Platform

0 likes · 13 min read

How StarRocks Revamped Ctrip’s Ticket Metrics Platform for Lightning‑Fast Queries

DataFunSummit

Mar 24, 2024 · Big Data

Design and Implementation of a User Data Warehouse and Profiling System at 58.com

This article details the design and implementation of a user data warehouse at 58.com, covering data warehouse fundamentals, user profiling concepts, multi‑layer architecture, modeling methods, ETL migration from Hive to Spark, data quality assurance, and the resulting achievements.

Big DataData WarehouseETL

0 likes · 20 min read

Design and Implementation of a User Data Warehouse and Profiling System at 58.com

DataFunTalk

Mar 1, 2024 · Big Data

Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study

This article explains the fundamentals of Data Fabric and data virtualization, highlights the limitations of traditional centralized data warehouses, describes the three‑layer virtualization architecture, and presents a detailed securities‑industry case study that demonstrates cost, efficiency, and compliance benefits.

Big DataData FabricData Integration

0 likes · 17 min read

Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study

Sohu Tech Products

Jan 31, 2024 · Operations

Logstash Grok Filter: Complete Guide for Log Data Parsing and ETL

This guide explains Logstash’s Grok filter plugin, detailing how its 120 built‑in and custom patterns transform unstructured logs—such as Apache, MySQL, or HiveServer2—into structured fields through named regex captures, supporting type conversion, cleaning, debugging, and efficient ETL for analysis and monitoring.

ETLGrok filterLogstash

0 likes · 8 min read

Logstash Grok Filter: Complete Guide for Log Data Parsing and ETL

Alibaba Cloud Big Data AI Platform

Dec 28, 2023 · Big Data

How LLMs Can Revolutionize Data Warehouse ETL: From Push‑Pull to Stable Queries

This article explores the challenges of traditional data‑warehouse ETL, compares push and pull models, and presents an LLM‑driven architecture that generates both on‑demand SQL queries and streaming ETL code with automatic error‑feedback loops, dramatically improving cost, accuracy, and maintainability.

Big DataData WarehouseETL

0 likes · 16 min read

How LLMs Can Revolutionize Data Warehouse ETL: From Push‑Pull to Stable Queries

DataFunTalk

Dec 5, 2023 · Big Data

Design and Practice of Xiaomi’s One‑Stop Data Production Platform

This article presents a comprehensive overview of Xiaomi’s data production platform, detailing the full data lifecycle, the technical‑driven product design methodology, the platform’s architecture and core capabilities, as well as real‑world case studies and a Q&A session that illustrate how the system improves data collection, storage, processing, and usage across the organization.

Data EngineeringData LifecycleData Platform

0 likes · 17 min read

Design and Practice of Xiaomi’s One‑Stop Data Production Platform

DataFunSummit

Dec 1, 2023 · Big Data

Bilibili's Event Tracking Standardization: Practices, Challenges, and Future Directions

This article details Bilibili's comprehensive approach to standardizing event tracking (埋点), covering its definition, data pipeline, common business issues, metadata‑driven management strategies, efficiency gains, and future prospects for unified real‑time and batch processing.

AnalyticsBilibiliData Standardization

0 likes · 21 min read

Bilibili's Event Tracking Standardization: Practices, Challenges, and Future Directions

HomeTech

Nov 28, 2023 · Big Data

Evolution of Payment Reconciliation Architecture: From MySQL to StarRocks with Flink and DataX

This article describes how a payment reconciliation system progressed from a simple MySQL‑based solution through a Hive‑based big‑data approach to a high‑performance StarRocks architecture, detailing the integration of Flink, DataX, and SQL adaptations that dramatically improved query speed, cost, and operational efficiency.

Big DataETLFlink

0 likes · 8 min read

Evolution of Payment Reconciliation Architecture: From MySQL to StarRocks with Flink and DataX

Alibaba Cloud Native

Nov 23, 2023 · Cloud Native

How CDC + Serverless Functions Enable Real‑Time ETL in Cloud Native Architectures

This article explains how Alibaba Cloud's Serverless Function Compute combined with Database Change Data Capture (CDC) creates a complete, real‑time ETL pipeline, detailing the ETL model, DTS integration, architecture components, event‑driven processing, and practical use cases such as OLTP‑to‑OLAP data flow.

Alibaba CloudCDCData Integration

0 likes · 10 min read

How CDC + Serverless Functions Enable Real‑Time ETL in Cloud Native Architectures

dbaplus Community

Nov 8, 2023 · Big Data

Choosing Between Data Warehouse, Data Lake, and Lakehouse: When to Use Each

This article compares traditional data warehouses, modern data lakes, and emerging lakehouse architectures, explaining their design patterns, advantages, disadvantages, and suitable use cases, while detailing implementation considerations such as schema design, ETL/ELT processes, file formats like Delta, Iceberg, and Hudi, and factors influencing platform selection.

Apache SparkData LakeData Warehouse

0 likes · 20 min read

Choosing Between Data Warehouse, Data Lake, and Lakehouse: When to Use Each

Code Ape Tech Column

Oct 24, 2023 · Big Data

Synchronizing MySQL Data to Elasticsearch Using Logstash

This tutorial explains how to set up the environment, configure Elasticsearch and Logstash, create the necessary MySQL tables, and use a Logstash pipeline to continuously sync MySQL records into an Elasticsearch index, while also covering common pitfalls and troubleshooting steps.

Data synchronizationETLElasticsearch

0 likes · 12 min read

Synchronizing MySQL Data to Elasticsearch Using Logstash

Rare Earth Juejin Tech Community

Oct 14, 2023 · Backend Development

Cache Warm-up Strategies and Lessons Learned from a Production Incident

The article recounts a developer's painful experience with un‑preheated Redis cache that caused severe latency spikes, then outlines practical cache warm‑up techniques such as gray‑release traffic, database scanning, and ETL‑driven data pipelines to prevent performance degradation and cache snowball effects.

CacheETLRedis

0 likes · 7 min read

Cache Warm-up Strategies and Lessons Learned from a Production Incident

dbaplus Community

Oct 14, 2023 · Big Data

What Is a Data Warehouse? From Basics to Modern Practices

This article explains what a data warehouse is, contrasts it with traditional databases, outlines the evolution from classic to internet‑scale warehouses, details modeling approaches and layered architectures, discusses KPI dictionaries, date dimensions, naming standards, data governance, incremental loading techniques, and upstream/downstream coordination.

Big DataData GovernanceETL

0 likes · 25 min read

What Is a Data Warehouse? From Basics to Modern Practices

DaTaobao Tech

Oct 11, 2023 · Big Data

Fundamental Data Skills and Complex Query Techniques in MaxCompute

The article teaches developers essential MaxCompute data‑processing skills—from creating and naming tables, handling strings and dates, and writing basic SELECTs, joins, and aggregations, to employing advanced techniques such as temporary tables, CTEs, partitioning, and map‑join hints for efficient complex queries.

Data EngineeringETLMaxCompute

0 likes · 15 min read

Fundamental Data Skills and Complex Query Techniques in MaxCompute

Architects Research Society

Sep 26, 2023 · Big Data

From a Single Data Lake to a Decentralized Data Mesh: A Step‑by‑Step Migration Guide

This article explains why traditional centralized data lakes hinder modern software development, introduces the data‑mesh concept as a decentralized alternative, and walks through an e‑commerce microservice example with concrete steps, data‑API designs, and migration tactics to transition from a monolithic lake to a distributed data mesh.

Data LakeData MeshData Platform

0 likes · 22 min read

From a Single Data Lake to a Decentralized Data Mesh: A Step‑by‑Step Migration Guide

DataFunTalk

Sep 3, 2023 · Big Data

Evolution of OLAP at Xingyun Retail Credit Using Apache Doris

This article details how Xingyun Retail Credit transitioned from traditional data warehouses to an Apache Doris‑based OLAP solution, covering data demand generation, OLAP engine selection challenges, multi‑stage implementation, performance optimizations, data‑warehouse construction, real‑world use cases, and future roadmap.

Apache DorisBig DataData Warehouse

0 likes · 16 min read

Evolution of OLAP at Xingyun Retail Credit Using Apache Doris

Weimob Technology Center

Aug 25, 2023 · Fundamentals

How to Build a Scalable Data Warehouse for the New WOS System

This article outlines the end‑to‑end process of designing, building, and governing a data‑warehouse model for the new commercial WOS system, covering business research, data‑domain division, multi‑layer architecture, modeling methods, practical case studies, governance challenges, and improvement strategies.

ETLGovernancemodeling

0 likes · 27 min read

How to Build a Scalable Data Warehouse for the New WOS System

Java Backend Technology

Aug 19, 2023 · Big Data

Top ETL Tools Compared: Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, Canal

This guide reviews the most popular ETL and data integration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, and typical use cases to help you choose the right solution for data migration and synchronization.

Big DataCDCData Integration

0 likes · 13 min read

Top ETL Tools Compared: Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, Canal

21CTO

Aug 16, 2023 · Big Data

6 Must-Have Snowflake Tools to Supercharge Your Data Workflow

This guide reviews six popular Snowflake‑compatible tools—covering data preparation, visualization, integration/ETL, business intelligence, and governance—that can dramatically boost productivity for data professionals.

Business IntelligenceData GovernanceData Visualization

0 likes · 11 min read

6 Must-Have Snowflake Tools to Supercharge Your Data Workflow

Qunar Tech Salon

Aug 3, 2023 · Big Data

Optimizing Hotel List Page Data Warehouse Flow at Qunar.com: A Technical Case Study

This article presents a comprehensive case study of how Qunar.com’s hotel data warehouse team identified performance bottlenecks in the L‑page traffic table, applied a series of "split, lift, parallel, delay" strategies, and achieved significant reductions in processing time, storage usage, and query latency.

Data WarehouseETL

0 likes · 21 min read

Optimizing Hotel List Page Data Warehouse Flow at Qunar.com: A Technical Case Study

ByteDance Data Platform

Jul 5, 2023 · Cloud Native

How to Seamlessly Integrate ByteHouse Cloud Data Warehouse with Apache Airflow

This guide explains how to combine ByteHouse's cloud‑native data warehouse with Apache Airflow to build scalable, automated, and easy‑to‑manage data pipelines, covering business scenarios, data flow, and step‑by‑step installation and DAG creation.

Apache AirflowByteHouseCloud Data Warehouse

0 likes · 10 min read

How to Seamlessly Integrate ByteHouse Cloud Data Warehouse with Apache Airflow

DataFunTalk

Jul 4, 2023 · Big Data

Integrating Apache Airflow with ByteHouse: A Step‑by‑Step Guide

This guide explains how to integrate Apache Airflow with ByteHouse, highlighting scalability, automated workflow management, and simple deployment, and provides a step‑by‑step tutorial—including prerequisites, installation, configuration, DAG creation, and execution commands—to build a robust data pipeline for analytics and machine learning.

Apache AirflowByteHouseCloud Data Warehouse

0 likes · 10 min read

Integrating Apache Airflow with ByteHouse: A Step‑by‑Step Guide

dbaplus Community

Jun 26, 2023 · Databases

Migrate MySQL 8.0 to MariaDB or ClickHouse with the binlog_parse_sql Toolkit

This guide explains how to use the open‑source binlog_parse_sql tool to parse MySQL 8.0 binary logs, generate SQL statements, and seamlessly migrate data to MariaDB or ClickHouse, covering installation, execution modes, troubleshooting, and practical ETL scenarios.

BinlogClickHouseData Migration

0 likes · 8 min read

Migrate MySQL 8.0 to MariaDB or ClickHouse with the binlog_parse_sql Toolkit

Architects Research Society

Jun 24, 2023 · Databases

Understanding OLTP and OLAP: Differences, Use Cases, and ETL Integration

The article explains the fundamental differences between OLTP (online transaction processing) and OLAP (online analytical processing), describes how ETL bridges the two, and provides a detailed side‑by‑side comparison of their characteristics, purposes, and design considerations.

Data WarehousingETLOLAP

0 likes · 9 min read

Understanding OLTP and OLAP: Differences, Use Cases, and ETL Integration

Ctrip Technology

Jun 15, 2023 · Databases

Rebuilding Ctrip Train Ticket Metrics Platform with StarRocks: Architecture, Data Synchronization, and Performance Gains

The article details how Ctrip's train ticket business revamped its multi‑engine OLAP metrics platform by consolidating to the StarRocks MPP database, describing the new architecture, query workflow, data synchronization strategies, practical lessons, and the resulting dramatic improvement in query latency and reliability.

ETLMetrics PlatformSQL

0 likes · 15 min read

Rebuilding Ctrip Train Ticket Metrics Platform with StarRocks: Architecture, Data Synchronization, and Performance Gains

DataFunSummit

Jun 12, 2023 · Big Data

From Data Integration to the Modern Data Stack: Concepts, Tools, and Practices

This article explains data integration fundamentals, compares data integration tools such as Stitch, Fivetran, and Airbyte, describes the concepts of data warehouses and data lakes, outlines ETL vs ELT processes, and explores building modern data stacks with Flink CDC and cloud services.

Big DataData IntegrationELT

0 likes · 17 min read

From Data Integration to the Modern Data Stack: Concepts, Tools, and Practices

Data Thinking Notes

Jun 4, 2023 · Big Data

How Distributed Lakehouse Architecture Solves Data Swamp Challenges

This article examines the explosion of heterogeneous data sources, the limitations of traditional data lakes and warehouses, and proposes a distributed lakehouse architecture that integrates advanced management layers to improve data governance, reliability, and support both SQL and advanced analytics workloads.

Data GovernanceData LakeData Warehouse

0 likes · 29 min read

How Distributed Lakehouse Architecture Solves Data Swamp Challenges

Data Thinking Notes

May 31, 2023 · Big Data

Why Data Lineage Is Essential: From Concepts to Practical Implementation

This article explains what data lineage is, its components, why it matters for data quality, security, and operational efficiency, and provides a comprehensive implementation guide covering open‑source tools, commercial platforms, custom builds, graph‑database modeling, automatic and manual lineage capture, visualization, analytics, and evaluation metrics.

Data GovernanceETLdata lineage

0 likes · 18 min read

Why Data Lineage Is Essential: From Concepts to Practical Implementation

DataFunTalk

May 14, 2023 · Databases

Design and Implementation of Materialized Views in YouShu BI for Performance Optimization

This article presents a comprehensive overview of YouShu BI's materialized view product, detailing performance pain points of traditional BI, the fundamentals and architecture of database materialized views, the product's design, ETL generation, query rewriting, scheduling, intelligent recommendation, and future development directions.

BIClickHouseETL

0 likes · 15 min read

Design and Implementation of Materialized Views in YouShu BI for Performance Optimization

MaGe Linux Operations

Apr 28, 2023 · Big Data

How to Sync 50 Million Rows Efficiently with Alibaba’s DataX

This guide explains why traditional mysqldump and file‑based methods fail for massive cross‑database sync, introduces Alibaba’s open‑source DataX middleware, details its framework and plugin architecture, walks through installation on Linux, shows how to configure MySQL source and target, and demonstrates both full and incremental data synchronization with practical JSON job examples.

DataXETLMySQL

0 likes · 14 min read

How to Sync 50 Million Rows Efficiently with Alibaba’s DataX

ITPUB

Apr 25, 2023 · Big Data

Top 8 Open‑Source ETL Tools for Data Migration and Integration

This article reviews eight widely used ETL and data‑migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, supported data sources, and typical usage scenarios to help practitioners choose the right solution.

Big DataData IntegrationData Migration

0 likes · 13 min read

Top 8 Open‑Source ETL Tools for Data Migration and Integration

Data Thinking Notes

Apr 9, 2023 · Big Data

Why Data Quality Is the Hidden Driver of Big Data Success

In the big‑data era, high‑quality data are essential for reliable analytics, and this article explains data‑quality concepts, key dimensions, analysis methods for missing values, outliers, inconsistencies and duplicates, as well as practical management practices to ensure data assets become a competitive advantage.

Big DataData GovernanceData Management

0 likes · 15 min read

Why Data Quality Is the Hidden Driver of Big Data Success

Java High-Performance Architecture

Mar 29, 2023 · Backend Development

Boost Data Sync Speed 8‑10×: Integrating Alibaba DataX into Spring Boot

This article explains how to replace a slow Kettle‑based ETL process with Alibaba DataX, covering environment setup, compilation, Maven integration, Java invocation, and performance results that show a ten‑fold speed increase for syncing over a million records.

DataXETLJava

0 likes · 6 min read

Boost Data Sync Speed 8‑10×: Integrating Alibaba DataX into Spring Boot

macrozheng

Mar 27, 2023 · Big Data

Top 8 Open-Source ETL Tools for Efficient Data Migration

This guide reviews eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, and use cases to help engineers choose the right solution for reliable data integration.

Big DataData IntegrationData Migration

0 likes · 14 min read

Top 8 Open-Source ETL Tools for Efficient Data Migration

Su San Talks Tech

Mar 24, 2023 · Big Data

Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration

Explore a comprehensive overview of eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their features, architectures, and use cases to help you choose the right solution for efficient data integration.

Big DataData IntegrationData Migration

0 likes · 13 min read

Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration

Architecture Digest

Mar 22, 2023 · Big Data

Performance Platform: Accelerating Data Production and Consumption

This article details how the Performance Platform at Baidu speeds up data production and consumption across the company's R&D pipelines by introducing five optimization paths, 18 concrete methods, service tiering, compliance measures, and self‑service analytics for both real‑time memory tables and offline disk tables.

Data EngineeringETLSelf‑service analytics

0 likes · 13 min read

Performance Platform: Accelerating Data Production and Consumption