Tagged articles
297 articles
Page 1 of 3
Architect's Guide
Architect's Guide
May 9, 2026 · Databases

Alibaba’s Open‑Source DataX: Fast, Easy Offline Data Synchronization

This article introduces Alibaba’s open‑source DataX tool, explains its framework‑plugin architecture for heterogeneous database sync, walks through Linux installation, job configuration, full‑ and incremental MySQL synchronization, and shares performance results and practical tips.

DataXETLIncremental Sync
0 likes · 15 min read
Alibaba’s Open‑Source DataX: Fast, Easy Offline Data Synchronization
Big Data Tech Team
Big Data Tech Team
Jan 26, 2026 · Big Data

Master DWD, DWS, and Wide‑Table Modeling for Scalable Data Warehouses

This guide explains the DWD (detail) and DWS (summary) layered modeling approach combined with wide‑table driving, covering model positioning, design principles, concrete schema examples, implementation techniques, performance tips, and common pitfalls to help build clean, reusable, high‑performance enterprise data warehouses.

DWDDWSETL
0 likes · 9 min read
Master DWD, DWS, and Wide‑Table Modeling for Scalable Data Warehouses
Big Data Tech Team
Big Data Tech Team
Jan 15, 2026 · Big Data

Mastering Data Warehousing: Core Concepts, Tools, and Future Trends

This article outlines a comprehensive roadmap for data warehousing, covering fundamental concepts, essential big‑data tools, practical implementation steps, advanced architectural topics, and emerging trends such as cloud‑native warehouses and machine‑learning integration, helping readers build a solid knowledge base.

ETLOLAPcloud data warehouse
0 likes · 9 min read
Mastering Data Warehousing: Core Concepts, Tools, and Future Trends
Big Data Tech Team
Big Data Tech Team
Jan 12, 2026 · Fundamentals

Why Wide Tables Are Essential in DWS Layer: 10 Real-World Modeling Scenarios

This article explains the purpose of the DWS (Data Warehouse Service) layer, why wide‑table modeling is crucial for performance and service‑oriented interfaces, and provides ten practical wide‑table designs with core field definitions, CREATE TABLE statements, and sample INSERT queries for common business domains such as products, users, orders, regions, channels, suppliers, services, finance, logistics, and data quality monitoring.

AnalyticsETLsql
0 likes · 34 min read
Why Wide Tables Are Essential in DWS Layer: 10 Real-World Modeling Scenarios
Big Data Tech Team
Big Data Tech Team
Jan 12, 2026 · Big Data

Avoid the 5 Fatal DWS Design Traps and Build Scalable Data Warehouses

This article dissects the five most common pitfalls when transitioning from DWD to DWS aggregation tables—such as chimney‑style designs, over‑wide tables, grain mismatches, missing drill‑down keys, and performance neglect—and offers concrete, production‑ready solutions to create reusable, efficient, and cost‑effective data‑warehouse layers.

DWS DesignETLPerformance Optimization
0 likes · 9 min read
Avoid the 5 Fatal DWS Design Traps and Build Scalable Data Warehouses
Java Architect Handbook
Java Architect Handbook
Dec 15, 2025 · Industry Insights

How DBSyncer Simplifies Multi‑Source Data Synchronization Across Databases

The article introduces the open‑source DBSyncer middleware that enables full‑stack data synchronization across MySQL, Oracle, SQL Server, PostgreSQL, Elasticsearch and Kafka, outlines its visual composition, full‑ and incremental sync, real‑time monitoring, and provides step‑by‑step installation instructions while also mentioning related Java learning projects.

DBSyncerETLOpen-source
0 likes · 6 min read
How DBSyncer Simplifies Multi‑Source Data Synchronization Across Databases
AI Insight Log
AI Insight Log
Dec 8, 2025 · Artificial Intelligence

How to Teach Claude Any Framework in 20 Minutes with Skill Seekers

This article explains how the open‑source Skill Seekers tool automates the extraction, cleaning, structuring, and packaging of documentation, code repositories, and PDFs into Claude‑compatible Skills, enabling rapid onboarding of obscure frameworks while highlighting conflict detection, MCP integration, and practical usage tips.

AI toolingClaudeETL
0 likes · 9 min read
How to Teach Claude Any Framework in 20 Minutes with Skill Seekers
Top Architect
Top Architect
Dec 1, 2025 · Big Data

Master DataX: Fast MySQL‑to‑MySQL Data Synchronization and Incremental Updates

This guide walks you through installing JDK, Python and DataX on Linux, configuring MySQL sources, creating the necessary tables and stored procedures, and using DataX's JSON job definitions to perform both full‑load and incremental data synchronization between two MySQL instances, complete with performance metrics and troubleshooting tips.

DataXETLLinux
0 likes · 16 min read
Master DataX: Fast MySQL‑to‑MySQL Data Synchronization and Incremental Updates
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 20, 2025 · Big Data

Mastering Large‑Scale Data Migration: Challenges, Strategies and Real‑World Solutions

This article explains why data migration is the essential first step for cloud modernization, outlines the technical challenges of moving terabytes to petabytes, compares physical and logical migration methods, and presents practical solutions and real‑world case studies across Hive, cloud warehouses, lake‑house formats and analytic databases.

Big DataData MigrationETL
0 likes · 56 min read
Mastering Large‑Scale Data Migration: Challenges, Strategies and Real‑World Solutions
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 7, 2025 · Big Data

Unlock Enterprise‑Grade Data Pipelines with DMS Airflow: Features, Integration & Code Samples

This article introduces DMS Airflow, an enterprise‑level data workflow orchestration platform built on Apache Airflow, covering its advanced DAG capabilities, deep DMS integration, scheduling, task dependency management, dynamic task generation, resource scaling, security features, and practical code examples for SQL, Spark, DTS, and Notebook tasks.

AirflowBig DataDMS
0 likes · 20 min read
Unlock Enterprise‑Grade Data Pipelines with DMS Airflow: Features, Integration & Code Samples
Big Data Tech Team
Big Data Tech Team
Oct 30, 2025 · Big Data

Mastering the ADS Layer: Design Principles, Modeling, and Real‑Time Data Services

This article provides a comprehensive analysis of the ADS (Application Data Service) layer in a data‑warehouse architecture, covering its core positioning, design goals, modeling strategies, dimension‑optimization techniques, API services, typical challenges, and practical best‑practice recommendations for high‑performance, flexible, and secure data delivery.

ADS layerETLsql
0 likes · 8 min read
Mastering the ADS Layer: Design Principles, Modeling, and Real‑Time Data Services
Selected Java Interview Questions
Selected Java Interview Questions
Oct 21, 2025 · Big Data

How to Sync Massive MySQL Datasets Efficiently with DataX

This guide walks through the challenges of synchronizing tens of millions of records between heterogeneous MySQL databases, explains why traditional mysqldump or file‑based methods fail, and provides a step‑by‑step tutorial on installing, configuring, and using Alibaba's open‑source DataX tool for both full and incremental data synchronization.

Big DataDataXETL
0 likes · 15 min read
How to Sync Massive MySQL Datasets Efficiently with DataX
Baidu Geek Talk
Baidu Geek Talk
Oct 13, 2025 · Big Data

How Baidu Scaled Its Data Warehouse to Handle Billions of PVs and Petabytes

This article details Baidu APP's massive data‑warehouse overhaul, describing the two‑step strategy that stabilized log cleaning, modernized the ETL framework, introduced wide‑table architectures, and implemented tiered storage to dramatically improve processing speed, reliability, and cost efficiency for petabyte‑scale workloads.

Big DataETLPerformance Optimization
0 likes · 25 min read
How Baidu Scaled Its Data Warehouse to Handle Billions of PVs and Petabytes
Big Data Tech Team
Big Data Tech Team
Sep 15, 2025 · Interview Experience

Top Data Warehouse Engineer Interview Questions & Answers Revealed

This article compiles three interview rounds for a data warehouse engineer role, covering fundamental concepts, practical skills, and leadership thinking with detailed Q&A on ETL, Hadoop components, schema design, data quality, data lake vs. warehouse, ACID properties, cloud solutions, SQL optimization, real‑time processing, security, and team management.

ETLHadoopSQL optimization
0 likes · 12 min read
Top Data Warehouse Engineer Interview Questions & Answers Revealed
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 10, 2025 · Databases

When to Use Materialized Views in Production: Benefits, Types, and Pitfalls

This article explains what materialized views are, outlines their advantages such as query acceleration, lightweight ETL, and lake‑warehouse integration, classifies them by sync mode, table count, and refresh strategy, and highlights their limitations and best‑practice recommendations for production use.

Data WarehousingDatabase PerformanceETL
0 likes · 6 min read
When to Use Materialized Views in Production: Benefits, Types, and Pitfalls
Big Data Tech Team
Big Data Tech Team
Jul 23, 2025 · Big Data

From Beginner to Data Warehouse Architect: A Complete Roadmap

This guide walks you through every essential topic—from data warehouse architecture and layering, through ETL, OLAP, Hadoop, and Flink, to visualization tools, learning paths, recommended resources, and the management skills needed to become a proficient data warehouse architect.

ETLFlinkHadoop
0 likes · 9 min read
From Beginner to Data Warehouse Architect: A Complete Roadmap
Architect
Architect
Jul 7, 2025 · Big Data

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

This article explains how Baidu’s search data team redesigned its data warehouse with wide‑table modeling, Parquet columnar storage, and a Spark‑ClickHouse fusion engine, eliminating redundancy, cutting query latency from minutes to seconds, and enabling self‑service analytics for thousands of users.

ETLParquetSpark
0 likes · 21 min read
How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×
Architect's Guide
Architect's Guide
Jun 14, 2025 · Big Data

Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling

This article explains the core components of a data warehouse ecosystem, distinguishes fact and dimension tables, outlines synchronization strategies, introduces star, snowflake, and constellation schemas, and details the layered architecture from ODS to data marts for effective big‑data analytics.

ETLFact Tabledata-warehouse
0 likes · 15 min read
Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling
Su San Talks Tech
Su San Talks Tech
May 29, 2025 · Big Data

How to Sync Massive MySQL Data with Alibaba DataX – Step‑by‑Step Guide

Facing a 50‑million‑row project with inaccurate reports and cross‑database operations, this guide explains why mysqldump and simple storage methods fail, introduces Alibaba’s open‑source DataX middleware, details its architecture, installation, and step‑by‑step configurations for full and incremental MySQL data synchronization.

DataXETLIncremental Sync
0 likes · 14 min read
How to Sync Massive MySQL Data with Alibaba DataX – Step‑by‑Step Guide
Zhuanzhuan Tech
Zhuanzhuan Tech
May 21, 2025 · Big Data

How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse

This article details the evolution of a fast‑growing e‑commerce finance platform from a monolithic microservice architecture plagued by data inconsistency, low processing efficiency, and scalability limits to a robust, distributed big‑data warehouse using SparkSQL, layered data models, and optimized scheduling, achieving ten‑fold performance gains and near‑zero failure rates.

ArchitectureBig DataETL
0 likes · 21 min read
How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse
Java Backend Technology
Java Backend Technology
May 21, 2025 · Big Data

Master DataX: Fast Offline Data Sync for MySQL without mysqldump

This guide explains how to use Alibaba's open‑source DataX tool to perform high‑performance offline synchronization between heterogeneous MySQL databases, covering installation, framework design, job configuration, full‑ and incremental sync, and practical command‑line examples.

Big DataDataXETL
0 likes · 15 min read
Master DataX: Fast Offline Data Sync for MySQL without mysqldump
Java Tech Enthusiast
Java Tech Enthusiast
May 13, 2025 · Big Data

Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync

This article introduces Alibaba DataX 3.0, explains its architecture and role‑based design, walks through Linux installation, JDK setup, MySQL preparation, and provides step‑by‑step examples of full‑load and incremental data synchronization between two MySQL instances using JSON job configurations and command‑line execution.

DataXETLIncremental Sync
0 likes · 14 min read
Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync
macrozheng
macrozheng
May 12, 2025 · Big Data

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Learn how to overcome inaccurate reporting and cross-database challenges by using Alibaba’s open-source DataX tool to efficiently synchronize massive MySQL datasets, covering its architecture, job scheduling, installation, configuration, full- and incremental sync, and practical command-line examples.

Big DataDataXETL
0 likes · 15 min read
Master DataX: Efficient Data Synchronization for Massive MySQL Datasets
Top Architect
Top Architect
May 7, 2025 · Big Data

Using DataX for Efficient MySQL Data Synchronization

This article provides a comprehensive guide on using Alibaba's open‑source DataX tool for efficient offline synchronization between heterogeneous databases such as MySQL, covering its architecture, installation on Linux, job configuration, full‑ and incremental data transfer, and practical code examples.

Big DataDataXETL
0 likes · 18 min read
Using DataX for Efficient MySQL Data Synchronization
ITPUB
ITPUB
Apr 17, 2025 · Databases

Migrate 700TB Over 2Mbps: Scripts, Sneakernet & Practical Steps

When a manager demands a script to move a 700‑terabyte database under a 2 Mbps bandwidth cap, the realistic solution combines physical Sneakernet transfer with a carefully staged export‑transform‑load script that handles field mapping, compression, rate‑limited transport, and fault‑tolerant import.

ETLjavalarge data transfer
0 likes · 8 min read
Migrate 700TB Over 2Mbps: Scripts, Sneakernet & Practical Steps
Big Data Tech Team
Big Data Tech Team
Apr 16, 2025 · Operations

Mastering Data Warehouse Naming: A Complete Guide to Standards and Processes

This article provides a comprehensive, step‑by‑step guide to data‑warehouse development, covering the full R&D workflow, data modeling layers, data dictionary creation, naming conventions for tables, columns, indexes and ETL jobs, metric standardization, and governance processes to ensure consistent, maintainable data assets across the organization.

ETLdata dictionarymetadata
0 likes · 28 min read
Mastering Data Warehouse Naming: A Complete Guide to Standards and Processes
Big Data Tech Team
Big Data Tech Team
Mar 17, 2025 · Big Data

How to Design and Review a Data Warehouse Model: A Complete Guide

This document outlines a comprehensive data warehouse model design and review process, covering revision records, project overview, business requirements, conceptual and logical modeling, ETL workflow, exception handling, and acceptance criteria with practical examples and templates.

ETLModel Designdata modeling
0 likes · 6 min read
How to Design and Review a Data Warehouse Model: A Complete Guide
Ma Wei Says
Ma Wei Says
Mar 16, 2025 · Databases

Mastering Slowly Changing Dimensions: Which SCD Strategy Fits Your Data Warehouse?

This article explains the concept of Slowly Changing Dimensions (SCD) in data warehouses, compares six common SCD handling methods—including SCD0, SCD1, SCD2, SCD3, combined SCD2+SCD3, and historical tables—and guides you on selecting the most suitable approach for your business needs.

ETLSCD TypesSlowly Changing Dimension
0 likes · 9 min read
Mastering Slowly Changing Dimensions: Which SCD Strategy Fits Your Data Warehouse?
Ma Wei Says
Ma Wei Says
Mar 11, 2025 · Big Data

Mastering DWS Layer Design: Principles, Steps, and Best Practices

This article explains the role of the DWS layer in data warehouses, outlines design principles, step‑by‑step modeling, naming conventions, field design, provides concrete DDL/ETL examples, common pitfalls, and how to build reusable, performant summary tables for analytics.

Big DataDWS LayerETL
0 likes · 15 min read
Mastering DWS Layer Design: Principles, Steps, and Best Practices
Ma Wei Says
Ma Wei Says
Feb 26, 2025 · Databases

Understanding Fact Tables: Types, Granularity, and Design Best Practices

This article explains fact tables in data warehousing, covering their definition, granularity, additive classifications, null handling, consistency rules, and the various types such as transaction, snapshot, cumulative, fact‑less, and aggregate tables, along with design trade‑offs and ETL considerations.

BIETLdimensional modeling
0 likes · 17 min read
Understanding Fact Tables: Types, Granularity, and Design Best Practices
vivo Internet Technology
vivo Internet Technology
Dec 18, 2024 · Big Data

Kafka Streams: Architecture, Configuration, and Monitoring Use Cases

Kafka Streams is a client library that enables low‑latency, fault‑tolerant real‑time processing of Kafka data through configurable topologies, time semantics, and state stores, and the article explains its architecture, essential configurations, monitoring‑focused ETL example, performance tuning, and strategies for handling partition skew.

Big DataETLStream Topology
0 likes · 25 min read
Kafka Streams: Architecture, Configuration, and Monitoring Use Cases
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 21, 2024 · Big Data

Key New Features of Apache Doris 3.0: Storage‑Compute Separation, Lakehouse Integration, Semi‑Structured Data, ETL Enhancements, Materialized Views, and Java UDTF

Apache Doris 3.0 introduces storage‑compute separation, native lakehouse write‑back, optimized Variant handling for semi‑structured data, stronger ETL transaction support, enhanced multi‑table materialized views, and Java UDTF capabilities, providing developers with more flexible, cost‑effective, and high‑performance analytics solutions.

Apache DorisETLJava UDTF
0 likes · 7 min read
Key New Features of Apache Doris 3.0: Storage‑Compute Separation, Lakehouse Integration, Semi‑Structured Data, ETL Enhancements, Materialized Views, and Java UDTF
macrozheng
macrozheng
Sep 27, 2024 · Big Data

Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

This guide walks through the challenges of synchronizing massive datasets across heterogeneous databases, introduces Alibaba's open‑source DataX tool, explains its framework‑plugin architecture, and provides step‑by‑step instructions—including environment setup, installation, job configuration, and both full and incremental MySQL synchronization—complete with code examples and performance metrics.

Big DataData IntegrationDataX
0 likes · 15 min read
Master DataX: Efficient Offline Data Sync for Heterogeneous Sources
dbaplus Community
dbaplus Community
Sep 5, 2024 · Databases

How to Migrate Data from MongoDB to MySQL Using DuckDB

This guide explains how to export MongoDB collections to JSON, load them into DuckDB, generate compatible table schemas, and then transfer the data efficiently into MySQL using DuckDB as an intermediate processing engine.

Data MigrationDuckDBETL
0 likes · 6 min read
How to Migrate Data from MongoDB to MySQL Using DuckDB
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 3, 2024 · Big Data

Mastering Data Modeling: From Raw Data to Insightful Warehouses

This article walks through the fundamentals of data modeling, explaining what data is, the DIKW framework, why modeling matters, and detailing the end‑to‑end process from conceptual design through logical and physical layers, including DIM, DWD, DWS, and ADM tables with practical tips and naming conventions.

ETLdata modelingdata-warehouse
0 likes · 11 min read
Mastering Data Modeling: From Raw Data to Insightful Warehouses
DataFunTalk
DataFunTalk
Aug 8, 2024 · Big Data

Building a User Profile Data Warehouse at 58.com: Architecture, Modeling, and Practices

This article details the design and implementation of a user‑profile data warehouse at 58.com, covering data‑warehouse fundamentals, user‑profile tag generation, layered architecture, dimensional modeling choices, ETL migration from Hive to Spark, data‑quality safeguards, and the resulting scale of tables, metrics and tags.

ETLdimensional modelinguser profiling
0 likes · 20 min read
Building a User Profile Data Warehouse at 58.com: Architecture, Modeling, and Practices
DataFunTalk
DataFunTalk
Jul 10, 2024 · Big Data

Apache SeaTunnel: A Next‑Generation Data Integration Platform for ETL/ELT and OLAP

This article introduces Apache SeaTunnel, a modern data integration platform designed for the EtLT era, detailing its architecture, core connector APIs, checkpoint mechanism, model inference, multi‑table synchronization, the high‑performance SeaTunnel Zeta engine, OLAP use cases, community roadmap, and the commercial WhaleTunnel product.

Apache SeaTunnelBig DataELT
0 likes · 22 min read
Apache SeaTunnel: A Next‑Generation Data Integration Platform for ETL/ELT and OLAP
DaTaobao Tech
DaTaobao Tech
Jul 8, 2024 · Big Data

ODPS (MaxCompute) SQL Basics, Data Integration and Hologres Import Guide

This guide provides a comprehensive, beginner‑to‑advanced reference for ODPS (MaxCompute) SQL, covering table creation, DDL/DML commands, query syntax, join hints, MySQL‑to‑ODPS synchronization, one‑click and custom imports into Hologres, and scheduling variables for automated data pipelines.

Data IntegrationETLHologres
0 likes · 37 min read
ODPS (MaxCompute) SQL Basics, Data Integration and Hologres Import Guide
DevOps
DevOps
Jun 27, 2024 · Big Data

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

This article explores agile data engineering, advocating code‑as‑infrastructure practices such as code‑everything, data and code reuse, and ETL‑level continuous integration, while discussing the trade‑offs between data‑centric and code‑centric reuse, cost considerations, and practical implementation tips for modern data projects.

Agile DevelopmentBig DataCode as Infrastructure
0 likes · 22 min read
Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration
dbaplus Community
dbaplus Community
May 27, 2024 · Backend Development

Why Cache Warm‑up Is Critical and How to Do It Effectively

The article recounts a painful production incident caused by missing cache warm‑up, explains why pre‑loading caches is essential for performance and reliability, and presents practical strategies such as gray‑scale rollout, database scanning, and ETL‑driven cache filling.

Backend EngineeringCache Warm-upETL
0 likes · 8 min read
Why Cache Warm‑up Is Critical and How to Do It Effectively
Big Data Technology & Architecture
Big Data Technology & Architecture
May 27, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

The Athena Data Factory, built by Spark Thinking, is a comprehensive one‑stop data development and governance platform that integrates data integration, development, analysis, and services, offering offline, real‑time, and AI pipelines, modular architecture, extensive monitoring, and cost‑optimisation to empower thousands of users across the company.

AirflowBig DataCloud Computing
0 likes · 26 min read
Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact
DataFunTalk
DataFunTalk
May 26, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking

The article details how Sparkle Thinking built the Athena Data Factory—a comprehensive, self‑service data development and governance platform that integrates data integration, ETL, real‑time processing, monitoring, and analytics, describing its architecture, key technologies, implementation timeline, operational practices, performance gains, and future directions.

AirflowETLFlink
0 likes · 26 min read
Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking
DataFunTalk
DataFunTalk
May 13, 2024 · Big Data

Data Integration Maturity Model: From ETL to EtLT

The article examines the evolution of data integration architectures—from traditional ETL through ELT to the emerging EtLT model—highlighting their advantages, disadvantages, industry trends, maturity stages, and practical guidance for enterprises and professionals navigating modern big‑data pipelines.

Big DataData IntegrationDataOps
0 likes · 31 min read
Data Integration Maturity Model: From ETL to EtLT
DataFunTalk
DataFunTalk
Mar 1, 2024 · Big Data

Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study

This article explains the fundamentals of Data Fabric and data virtualization, highlights the limitations of traditional centralized data warehouses, describes the three‑layer virtualization architecture, and presents a detailed securities‑industry case study that demonstrates cost, efficiency, and compliance benefits.

Big DataData FabricData Integration
0 likes · 17 min read
Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study
Sohu Tech Products
Sohu Tech Products
Jan 31, 2024 · Operations

Logstash Grok Filter: Complete Guide for Log Data Parsing and ETL

This guide explains Logstash’s Grok filter plugin, detailing how its 120 built‑in and custom patterns transform unstructured logs—such as Apache, MySQL, or HiveServer2—into structured fields through named regex captures, supporting type conversion, cleaning, debugging, and efficient ETL for analysis and monitoring.

ETLGrok filterLogstash
0 likes · 8 min read
Logstash Grok Filter: Complete Guide for Log Data Parsing and ETL
DataFunTalk
DataFunTalk
Dec 5, 2023 · Big Data

Design and Practice of Xiaomi’s One‑Stop Data Production Platform

This article presents a comprehensive overview of Xiaomi’s data production platform, detailing the full data lifecycle, the technical‑driven product design methodology, the platform’s architecture and core capabilities, as well as real‑world case studies and a Q&A session that illustrate how the system improves data collection, storage, processing, and usage across the organization.

Data LifecycleData PlatformETL
0 likes · 17 min read
Design and Practice of Xiaomi’s One‑Stop Data Production Platform
Alibaba Cloud Native
Alibaba Cloud Native
Nov 23, 2023 · Cloud Native

How CDC + Serverless Functions Enable Real‑Time ETL in Cloud Native Architectures

This article explains how Alibaba Cloud's Serverless Function Compute combined with Database Change Data Capture (CDC) creates a complete, real‑time ETL pipeline, detailing the ETL model, DTS integration, architecture components, event‑driven processing, and practical use cases such as OLTP‑to‑OLAP data flow.

Alibaba CloudCDCData Integration
0 likes · 10 min read
How CDC + Serverless Functions Enable Real‑Time ETL in Cloud Native Architectures
dbaplus Community
dbaplus Community
Nov 8, 2023 · Big Data

Choosing Between Data Warehouse, Data Lake, and Lakehouse: When to Use Each

This article compares traditional data warehouses, modern data lakes, and emerging lakehouse architectures, explaining their design patterns, advantages, disadvantages, and suitable use cases, while detailing implementation considerations such as schema design, ETL/ELT processes, file formats like Delta, Iceberg, and Hudi, and factors influencing platform selection.

Apache SparkData LakeDelta Lake
0 likes · 20 min read
Choosing Between Data Warehouse, Data Lake, and Lakehouse: When to Use Each
Code Ape Tech Column
Code Ape Tech Column
Oct 24, 2023 · Big Data

Synchronizing MySQL Data to Elasticsearch Using Logstash

This tutorial explains how to set up the environment, configure Elasticsearch and Logstash, create the necessary MySQL tables, and use a Logstash pipeline to continuously sync MySQL records into an Elasticsearch index, while also covering common pitfalls and troubleshooting steps.

ETLElasticsearchLinux
0 likes · 12 min read
Synchronizing MySQL Data to Elasticsearch Using Logstash
dbaplus Community
dbaplus Community
Oct 14, 2023 · Big Data

What Is a Data Warehouse? From Basics to Modern Practices

This article explains what a data warehouse is, contrasts it with traditional databases, outlines the evolution from classic to internet‑scale warehouses, details modeling approaches and layered architectures, discusses KPI dictionaries, date dimensions, naming standards, data governance, incremental loading techniques, and upstream/downstream coordination.

Big DataData GovernanceETL
0 likes · 25 min read
What Is a Data Warehouse? From Basics to Modern Practices
DaTaobao Tech
DaTaobao Tech
Oct 11, 2023 · Big Data

Fundamental Data Skills and Complex Query Techniques in MaxCompute

The article teaches developers essential MaxCompute data‑processing skills—from creating and naming tables, handling strings and dates, and writing basic SELECTs, joins, and aggregations, to employing advanced techniques such as temporary tables, CTEs, partitioning, and map‑join hints for efficient complex queries.

ETLMaxComputedata engineering
0 likes · 15 min read
Fundamental Data Skills and Complex Query Techniques in MaxCompute
Architects Research Society
Architects Research Society
Sep 26, 2023 · Big Data

From a Single Data Lake to a Decentralized Data Mesh: A Step‑by‑Step Migration Guide

This article explains why traditional centralized data lakes hinder modern software development, introduces the data‑mesh concept as a decentralized alternative, and walks through an e‑commerce microservice example with concrete steps, data‑API designs, and migration tactics to transition from a monolithic lake to a distributed data mesh.

Data LakeData MeshData Platform
0 likes · 22 min read
From a Single Data Lake to a Decentralized Data Mesh: A Step‑by‑Step Migration Guide
DataFunTalk
DataFunTalk
Sep 3, 2023 · Big Data

Evolution of OLAP at Xingyun Retail Credit Using Apache Doris

This article details how Xingyun Retail Credit transitioned from traditional data warehouses to an Apache Doris‑based OLAP solution, covering data demand generation, OLAP engine selection challenges, multi‑stage implementation, performance optimizations, data‑warehouse construction, real‑world use cases, and future roadmap.

Apache DorisBig DataETL
0 likes · 16 min read
Evolution of OLAP at Xingyun Retail Credit Using Apache Doris
Weimob Technology Center
Weimob Technology Center
Aug 25, 2023 · Fundamentals

How to Build a Scalable Data Warehouse for the New WOS System

This article outlines the end‑to‑end process of designing, building, and governing a data‑warehouse model for the new commercial WOS system, covering business research, data‑domain division, multi‑layer architecture, modeling methods, practical case studies, governance challenges, and improvement strategies.

ETLModelinggovernance
0 likes · 27 min read
How to Build a Scalable Data Warehouse for the New WOS System
21CTO
21CTO
Aug 16, 2023 · Big Data

6 Must-Have Snowflake Tools to Supercharge Your Data Workflow

This guide reviews six popular Snowflake‑compatible tools—covering data preparation, visualization, integration/ETL, business intelligence, and governance—that can dramatically boost productivity for data professionals.

Business IntelligenceData GovernanceData visualization
0 likes · 11 min read
6 Must-Have Snowflake Tools to Supercharge Your Data Workflow
DataFunTalk
DataFunTalk
Jul 4, 2023 · Big Data

Integrating Apache Airflow with ByteHouse: A Step‑by‑Step Guide

This guide explains how to integrate Apache Airflow with ByteHouse, highlighting scalability, automated workflow management, and simple deployment, and provides a step‑by‑step tutorial—including prerequisites, installation, configuration, DAG creation, and execution commands—to build a robust data pipeline for analytics and machine learning.

Apache AirflowByteHouseETL
0 likes · 10 min read
Integrating Apache Airflow with ByteHouse: A Step‑by‑Step Guide
Ctrip Technology
Ctrip Technology
Jun 15, 2023 · Databases

Rebuilding Ctrip Train Ticket Metrics Platform with StarRocks: Architecture, Data Synchronization, and Performance Gains

The article details how Ctrip's train ticket business revamped its multi‑engine OLAP metrics platform by consolidating to the StarRocks MPP database, describing the new architecture, query workflow, data synchronization strategies, practical lessons, and the resulting dramatic improvement in query latency and reliability.

ETLMetrics PlatformStarRocks
0 likes · 15 min read
Rebuilding Ctrip Train Ticket Metrics Platform with StarRocks: Architecture, Data Synchronization, and Performance Gains
Data Thinking Notes
Data Thinking Notes
Jun 4, 2023 · Big Data

How Distributed Lakehouse Architecture Solves Data Swamp Challenges

This article examines the explosion of heterogeneous data sources, the limitations of traditional data lakes and warehouses, and proposes a distributed lakehouse architecture that integrates advanced management layers to improve data governance, reliability, and support both SQL and advanced analytics workloads.

Data GovernanceData LakeELT
0 likes · 29 min read
How Distributed Lakehouse Architecture Solves Data Swamp Challenges
Data Thinking Notes
Data Thinking Notes
May 31, 2023 · Big Data

Why Data Lineage Is Essential: From Concepts to Practical Implementation

This article explains what data lineage is, its components, why it matters for data quality, security, and operational efficiency, and provides a comprehensive implementation guide covering open‑source tools, commercial platforms, custom builds, graph‑database modeling, automatic and manual lineage capture, visualization, analytics, and evaluation metrics.

Data GovernanceData LineageETL
0 likes · 18 min read
Why Data Lineage Is Essential: From Concepts to Practical Implementation
MaGe Linux Operations
MaGe Linux Operations
Apr 28, 2023 · Big Data

How to Sync 50 Million Rows Efficiently with Alibaba’s DataX

This guide explains why traditional mysqldump and file‑based methods fail for massive cross‑database sync, introduces Alibaba’s open‑source DataX middleware, details its framework and plugin architecture, walks through installation on Linux, shows how to configure MySQL source and target, and demonstrates both full and incremental data synchronization with practical JSON job examples.

DataXETLIncremental Sync
0 likes · 14 min read
How to Sync 50 Million Rows Efficiently with Alibaba’s DataX
ITPUB
ITPUB
Apr 25, 2023 · Big Data

Top 8 Open‑Source ETL Tools for Data Migration and Integration

This article reviews eight widely used ETL and data‑migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, supported data sources, and typical usage scenarios to help practitioners choose the right solution.

Big DataData IntegrationData Migration
0 likes · 13 min read
Top 8 Open‑Source ETL Tools for Data Migration and Integration
Data Thinking Notes
Data Thinking Notes
Apr 9, 2023 · Big Data

Why Data Quality Is the Hidden Driver of Big Data Success

In the big‑data era, high‑quality data are essential for reliable analytics, and this article explains data‑quality concepts, key dimensions, analysis methods for missing values, outliers, inconsistencies and duplicates, as well as practical management practices to ensure data assets become a competitive advantage.

Big DataData GovernanceData Management
0 likes · 15 min read
Why Data Quality Is the Hidden Driver of Big Data Success
macrozheng
macrozheng
Mar 27, 2023 · Big Data

Top 8 Open-Source ETL Tools for Efficient Data Migration

This guide reviews eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, and use cases to help engineers choose the right solution for reliable data integration.

Big DataData IntegrationData Migration
0 likes · 14 min read
Top 8 Open-Source ETL Tools for Efficient Data Migration
Su San Talks Tech
Su San Talks Tech
Mar 24, 2023 · Big Data

Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration

Explore a comprehensive overview of eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their features, architectures, and use cases to help you choose the right solution for efficient data integration.

Big DataData IntegrationData Migration
0 likes · 13 min read
Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration
Architecture Digest
Architecture Digest
Mar 22, 2023 · Big Data

Performance Platform: Accelerating Data Production and Consumption

This article details how the Performance Platform at Baidu speeds up data production and consumption across the company's R&D pipelines by introducing five optimization paths, 18 concrete methods, service tiering, compliance measures, and self‑service analytics for both real‑time memory tables and offline disk tables.

ETLSelf-Service Analyticsdata compliance
0 likes · 13 min read
Performance Platform: Accelerating Data Production and Consumption
政采云技术
政采云技术
Mar 9, 2023 · Fundamentals

Redesigning Data Warehouse Models: When and How to Use Dimensional Modeling

This article explains the concept of data models, why warehouse models need reconstruction, compares normative and dimensional modeling approaches, and provides a step‑by‑step guide—including information gathering, design, and implementation—to build efficient, maintainable data warehouse architectures.

Big DataDatabase designETL
0 likes · 12 min read
Redesigning Data Warehouse Models: When and How to Use Dimensional Modeling
政采云技术
政采云技术
Mar 7, 2023 · Databases

Data Warehouse Modeling: Concepts, Methods, and Implementation

This article explains what data models are, why model refactoring is necessary, compares normalized and dimensional data warehouse modeling approaches, and details a three‑step implementation process—including information research, model design, and model deployment—while highlighting best‑practice naming conventions and practical examples.

Big DataDatabase designETL
0 likes · 14 min read
Data Warehouse Modeling: Concepts, Methods, and Implementation
Architects Research Society
Architects Research Society
Mar 5, 2023 · Big Data

Best Open‑Source and Commercial ETL Tools: Detailed Comparison

This article introduces the concept of ETL, explains its importance for modern data‑driven applications, and provides a comprehensive comparison of the most popular open‑source and commercial ETL platforms—including their key features, supported data sources, and deployment options—helping readers choose the right tool for their data integration needs.

Big DataData IntegrationETL
0 likes · 19 min read
Best Open‑Source and Commercial ETL Tools: Detailed Comparison
Architects Research Society
Architects Research Society
Feb 18, 2023 · Big Data

Key Factors to Consider When Building Your Own Data Warehouse

This article examines the essential considerations for selecting a modern data warehouse—including data volume, staffing, scalability, and pricing models—while comparing on‑premise and cloud solutions such as Redshift, BigQuery, and Snowflake to help organizations make informed decisions.

ELTETLScalability
0 likes · 9 min read
Key Factors to Consider When Building Your Own Data Warehouse