Tagged articles

ETL

304 articles · Page 2 of 4

Mar 9, 2023 · Fundamentals

Redesigning Data Warehouse Models: When and How to Use Dimensional Modeling

This article explains the concept of data models, why warehouse models need reconstruction, compares normative and dimensional modeling approaches, and provides a step‑by‑step guide—including information gathering, design, and implementation—to build efficient, maintainable data warehouse architectures.

Big DataData WarehouseDatabase Design

0 likes · 12 min read

Redesigning Data Warehouse Models: When and How to Use Dimensional Modeling

Architects Research Society

Mar 7, 2023 · Big Data

Best Open‑Source ETL Tools: Detailed Comparison and Recommendations

This article provides an overview of the most popular ETL tools—both open‑source and commercial—explaining their core features, use cases, and how they simplify data extraction, transformation, and loading for modern data‑driven applications.

Big DataData IntegrationData Warehouse

0 likes · 10 min read

Best Open‑Source ETL Tools: Detailed Comparison and Recommendations

政采云技术

Mar 7, 2023 · Databases

Data Warehouse Modeling: Concepts, Methods, and Implementation

This article explains what data models are, why model refactoring is necessary, compares normalized and dimensional data warehouse modeling approaches, and details a three‑step implementation process—including information research, model design, and model deployment—while highlighting best‑practice naming conventions and practical examples.

Big DataData WarehouseDatabase Design

0 likes · 14 min read

Data Warehouse Modeling: Concepts, Methods, and Implementation

Architects Research Society

Mar 5, 2023 · Big Data

Best Open‑Source and Commercial ETL Tools: Detailed Comparison

This article introduces the concept of ETL, explains its importance for modern data‑driven applications, and provides a comprehensive comparison of the most popular open‑source and commercial ETL platforms—including their key features, supported data sources, and deployment options—helping readers choose the right tool for their data integration needs.

Big DataData IntegrationData Warehouse

0 likes · 19 min read

Best Open‑Source and Commercial ETL Tools: Detailed Comparison

Huolala Tech

Mar 2, 2023 · Big Data

Building a Unified Data Warehouse for Moving Services: Boosting Efficiency and Data Quality

This article details the challenges of fragmented ODS data in the moving‑service domain and explains how a dedicated public‑layer data warehouse, with layered architecture and quality monitoring, was designed and implemented to improve data reuse, reduce redundancy, and stabilize downstream analytics.

Big DataData QualityData Warehouse

0 likes · 15 min read

Building a Unified Data Warehouse for Moving Services: Boosting Efficiency and Data Quality

Architects Research Society

Feb 18, 2023 · Big Data

Key Factors to Consider When Building Your Own Data Warehouse

This article examines the essential considerations for selecting a modern data warehouse—including data volume, staffing, scalability, and pricing models—while comparing on‑premise and cloud solutions such as Redshift, BigQuery, and Snowflake to help organizations make informed decisions.

CloudELTETL

0 likes · 9 min read

Key Factors to Consider When Building Your Own Data Warehouse

Java High-Performance Architecture

Feb 3, 2023 · Big Data

How to Use Alibaba DataX for Efficient MySQL Data Synchronization

This guide explains how to install DataX, set up MySQL environments, configure JSON job files, and run both full and incremental data synchronization between heterogeneous databases using DataX's Reader/Writer framework and job scheduling features.

Big DataData synchronizationDataX

0 likes · 14 min read

How to Use Alibaba DataX for Efficient MySQL Data Synchronization

Architecture Digest

Feb 3, 2023 · Databases

Comprehensive Guide to Using DataX for Data Synchronization

This article provides a step‑by‑step tutorial on installing, configuring, and using Alibaba's open‑source DataX tool to perform both full and incremental data synchronization between MySQL databases on Linux, covering framework design, job architecture, JSON job files, and practical command‑line examples.

Data synchronizationDataXETL

0 likes · 14 min read

Comprehensive Guide to Using DataX for Data Synchronization

Data Thinking Notes

Jan 31, 2023 · Fundamentals

Mastering Data Governance: From Metadata to ETL in One Guide

This comprehensive guide walks you through the entire data governance ecosystem, covering metadata fundamentals, classification, maturity models, data standards, modeling, integration, lifecycle management, quality assurance, security, and ETL processes, all illustrated with clear diagrams and practical steps.

Data GovernanceData IntegrationData Quality

0 likes · 13 min read

Mastering Data Governance: From Metadata to ETL in One Guide

Selected Java Interview Questions

Jan 29, 2023 · Backend Development

Using DataX for MySQL Data Synchronization: Full and Incremental Sync Guide

This article explains how to install DataX, configure MySQL readers and writers, and execute both full and incremental data synchronization jobs between two MySQL instances, providing step‑by‑step commands, JSON job templates, and troubleshooting tips for large‑scale data transfers.

Data synchronizationDataXETL

0 likes · 13 min read

Using DataX for MySQL Data Synchronization: Full and Incremental Sync Guide

Code Ape Tech Column

Jan 28, 2023 · Big Data

Using Alibaba DataX for Offline Data Synchronization and Incremental Sync

This article introduces Alibaba DataX, explains its architecture and role in offline heterogeneous data synchronization, provides step‑by‑step Linux installation, demonstrates full‑load and incremental MySQL‑to‑MySQL sync with JSON job templates, and shares practical tips for handling large data volumes.

Data IntegrationDataXETL

0 likes · 15 min read

Using Alibaba DataX for Offline Data Synchronization and Incremental Sync

DataFunSummit

Jan 24, 2023 · Databases

Practical Experience of Using Apache Doris for Real‑Time Data Warehouse at Tongcheng Data Science

This article details how Tongcheng Data Science built a real‑time analytical data warehouse using Apache Doris, covering business scenarios, the evolution from a legacy 1.0 architecture to a Doris‑based 2.0 design, deployment topology, development workflow, operational benefits, and future roadmap.

Apache DorisBig DataData Architecture

0 likes · 10 min read

Practical Experience of Using Apache Doris for Real‑Time Data Warehouse at Tongcheng Data Science

Data Thinking Notes

Jan 12, 2023 · Big Data

Mastering Alibaba DataWorks: Data Warehouse Architecture & Modeling Guide

This comprehensive tutorial walks you through Alibaba DataWorks' data warehouse architecture, covering technical stack selection, three‑layer warehouse design (ODS, CDM, ADS), detailed data modeling with DDL examples, storage strategies, dimension and fact table conventions, and best‑practice hierarchical call standards.

DataModelingDataWarehouseDataWorks

0 likes · 27 min read

Mastering Alibaba DataWorks: Data Warehouse Architecture & Modeling Guide

Ctrip Technology

Jan 12, 2023 · Big Data

Evolution of Ctrip's Log System: From Elasticsearch to ClickHouse and Log 3.0

This article details the evolution of Ctrip's log infrastructure, describing the shift from fragmented departmental logging to a unified Elasticsearch-based platform, the migration to ClickHouse for cost‑effective, high‑performance storage, and the subsequent Log 3.0 redesign that leverages Kubernetes, sharding, and a unified query governance layer to handle petabyte‑scale data.

Big DataClickHouseETL

0 likes · 16 min read

Evolution of Ctrip's Log System: From Elasticsearch to ClickHouse and Log 3.0

DataFunTalk

Jan 6, 2023 · Big Data

ZhongAn's Hundred‑Billion‑Scale Data Integration Service: Architecture, Business Support, and Evolution

This article presents the architecture and practical experience of ZhongAn's hundred‑billion‑scale data integration service, covering common integration technologies, business support scenarios for offline and real‑time data, technical challenges, evolution from single‑machine to service‑oriented designs, and future directions using Flink and DataX.

Data IntegrationData PlatformDataX

0 likes · 31 min read

ZhongAn's Hundred‑Billion‑Scale Data Integration Service: Architecture, Business Support, and Evolution

Selected Java Interview Questions

Dec 26, 2022 · Big Data

Using DataX for Efficient MySQL Data Synchronization (Full and Incremental)

This article introduces DataX, an open‑source data integration tool, explains its architecture, and provides step‑by‑step instructions—including environment setup, installation, job JSON creation, and command execution—to achieve fast full‑ and incremental synchronization between MySQL databases.

Data synchronizationDataXETL

0 likes · 13 min read

Using DataX for Efficient MySQL Data Synchronization (Full and Incremental)

DataFunTalk

Dec 24, 2022 · Big Data

Evolution of Data Platforms: From Early Computers to the Modern Data Stack

This article traces the history of data platforms—from the first general‑purpose computers and traditional BI, through the rise of data warehouses, big‑data frameworks like Hadoop, Spark and Flink, to the modern data‑stack era with cloud‑native architectures, Lambda/Kappa models, and emerging tools—highlighting key technologies, architectural shifts, and future prospects.

Big DataCloud ComputingData Warehouse

0 likes · 26 min read

Evolution of Data Platforms: From Early Computers to the Modern Data Stack

Data Thinking Notes

Dec 23, 2022 · Big Data

How Real-Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

This article explains why real‑time data warehouses are becoming essential, outlines their goals, compares them with traditional offline warehouses, and presents detailed design patterns, naming conventions, and case studies from Didi, Kuaishou, Tencent, Youzan and other enterprises, highlighting challenges and solutions for streaming, storage, and query layers.

Big Data ArchitectureData LakeETL

0 likes · 49 min read

How Real-Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

Ziru Technology

Dec 16, 2022 · Big Data

How to Effectively Test Offline Data Metrics and Data Warehouse Pipelines

This article explains what data metrics are, compares offline metric testing with traditional testing, and provides a comprehensive step‑by‑step guide for testing data collection, ETL, warehouse models, metric calculations, scheduling, security, and API outputs in a Hive‑based data warehouse.

Data ValidationData WarehouseETL

0 likes · 9 min read

How to Effectively Test Offline Data Metrics and Data Warehouse Pipelines

Data Thinking Notes

Dec 15, 2022 · Big Data

Why 80% of Data Analysis Time Is Spent on Data Preparation—and How to Master It

Data preparation consumes about 80% of the entire analytics workflow, making data collection, quality assurance, and governance critical pillars—spanning metadata, master data, storage layers like data lakes and warehouses, and rigorous preprocessing—to turn raw information into reliable insights.

Big DataData GovernanceData Management

0 likes · 12 min read

Why 80% of Data Analysis Time Is Spent on Data Preparation—and How to Master It

Data Thinking Notes

Dec 8, 2022 · Big Data

Why Layer Your Data Warehouse? Unlock Performance, Cost Savings, and Maintainability

This article explains the purpose and benefits of data‑warehouse layering, outlines the four ETL steps, describes each architectural layer from ODS to ADS, presents modeling principles, naming conventions, and includes sample DDL to illustrate how layered design improves data quality, reuse, and operational efficiency.

Big DataData WarehouseETL

0 likes · 36 min read

Why Layer Your Data Warehouse? Unlock Performance, Cost Savings, and Maintainability

Architecture Digest

Dec 1, 2022 · Big Data

Understanding Data Warehouse Architecture and Layered Design

This article explains the concepts, architecture, and layered design of data warehouses, covering data flow, ETL processes, ODS, DWD, DWM, DWS, ADS layers, their characteristics, differences from databases, and the role of data marts in supporting OLAP and decision‑making.

AnalyticsBig DataData Layers

0 likes · 13 min read

Understanding Data Warehouse Architecture and Layered Design

DevOps Cloud Academy

Nov 22, 2022 · Big Data

Components and Key Terminology in Apache Airflow

Apache Airflow’s architecture consists of schedulers, executors, workers, a web server, and a metadata database, enabling scalable workflow orchestration, while essential terminology such as DAGs, operators, and sensors defines how tasks are organized, executed, and monitored within data pipelines.

Apache AirflowBig DataDAG

0 likes · 8 min read

Components and Key Terminology in Apache Airflow

Data Thinking Notes

Nov 16, 2022 · Big Data

Why Metadata Management Is Essential for Data Warehouses

This article explains the concept of metadata, its role in data warehouses, why managing metadata is critical for building, maintaining, and scaling data warehouse systems, and outlines practical steps, use cases, and tools for effective metadata management.

Data GovernanceData WarehouseETL

0 likes · 15 min read

Why Metadata Management Is Essential for Data Warehouses

Tencent Cloud Developer

Nov 7, 2022 · Big Data

Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

The article outlines comprehensive data‑engineering and warehouse‑design principles—covering collection (four Ws and methods like SDK, point‑code, binlog), reporting strategies, source selection, modeling with fact, aggregation, dimension and model tables, quality checks, and governance practices such as standardized SDKs, metric libraries, automated lineage, and cost optimization—to share actionable experience for any organization.

Big DataData EngineeringData Governance

0 likes · 32 min read

Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

Architecture Digest

Nov 5, 2022 · Big Data

Why Data Warehouse Modeling and Layered Architecture Matter

Data warehouse modeling organizes data into layered structures—ODS, DWD, DWS, and ADS—to improve performance, reduce costs, ensure data quality, enable traceability, simplify maintenance, and support both batch and real‑time analytics, while outlining best practices for ETL processes and schema design.

ETLSQLlayered architecture

0 likes · 37 min read

Why Data Warehouse Modeling and Layered Architecture Matter

dbaplus Community

Oct 30, 2022 · Big Data

Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs

This article explains the importance of layering in data warehouse modeling, outlines the four ETL steps, describes common pitfalls, presents a typical technical stack, and details each warehouse layer (ODS, DWD, DWS, ADS) along with best‑practice naming conventions and implementation tips for big‑data environments.

ETLHiveSpark

0 likes · 38 min read

Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs

Big Data Technology Architecture

Oct 25, 2022 · Big Data

Rebuilding Shopee's Data Integration Platform with Apache SeaTunnel

Shopee faced fragmented data‑ingestion pipelines, limited source support, and high maintenance overhead, so it evaluated open‑source tools and adopted Apache SeaTunnel to unify batch and streaming data transfers, simplify ETL workflows, and provide a scalable, extensible solution for its multi‑TB daily data processing needs.

Data IntegrationETLSeaTunnel

0 likes · 17 min read

Rebuilding Shopee's Data Integration Platform with Apache SeaTunnel

Big Data Technology & Architecture

Oct 24, 2022 · Big Data

Comprehensive Guide to Big Data Modeling and Data Warehouse Design

This article provides an in‑depth overview of big‑data modeling concepts, covering why data modeling is essential, relational versus analytical systems, common warehouse modeling methodologies, Alibaba's practical implementations, dimension design techniques, and detailed fact‑table design principles for modern data platforms.

ETLdimensional modeling

0 likes · 50 min read

Comprehensive Guide to Big Data Modeling and Data Warehouse Design

Xingsheng Youxuan Technology Community

Oct 14, 2022 · Big Data

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

This article explains how an e‑commerce company designed and implemented a modern data warehouse—combining batch Spark jobs, real‑time Flink streams, and Hudi data‑lake storage—to handle terabytes of daily logs, ensure data quality, and provide fast, reliable analytics for business decision‑making.

Data LakeData WarehouseETL

0 likes · 16 min read

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

Zhuanzhuan Tech

Aug 24, 2022 · Big Data

Real-Time Data Warehouse Architecture Using Flink: Design, Implementation, and Challenges

This article details the design and implementation of a real‑time data warehouse for an advertising platform, covering business background, challenges, a Lambda‑based architecture, Flink stream processing setup, ETL logic, sink handling, and performance results, concluding with future improvement directions.

ETLFlinkLambda architecture

0 likes · 11 min read

Real-Time Data Warehouse Architecture Using Flink: Design, Implementation, and Challenges

DevOps Cloud Academy

Aug 8, 2022 · Operations

Understanding DataOps ETL: Benefits, Automation, and Implementation Guide

This article explains DataOps and its role in modern ETL pipelines, outlines the benefits of DataOps for efficiency and reliability, and provides a detailed roadmap and best‑practice guidelines for planning, implementing, and optimizing DataOps‑driven ETL in cloud‑native environments.

AutomationCloudData Engineering

0 likes · 13 min read

Understanding DataOps ETL: Benefits, Automation, and Implementation Guide

37 Interactive Technology Team

Aug 8, 2022 · Backend Development

Time Management in Programming: Concepts, Practices, and Common Pitfalls

Time management in programming spans human concepts of time, language-specific handling of zones and timestamps, 32‑bit overflow risks, sync versus async processing, log timestamping, business‑level period calculations, and common pitfalls, emphasizing that mastering these nuances prevents bugs, improves performance, and enables reliable analytics.

Backend DevelopmentClickHouseETL

0 likes · 20 min read

Time Management in Programming: Concepts, Practices, and Common Pitfalls

Snowball Engineer Team

Aug 5, 2022 · Big Data

Snowball Data Warehouse Modeling and OneData System Implementation

This article outlines Snowball's data warehouse background, compares major modeling approaches such as ER, dimensional, DataVault and Anchor models, describes the current challenges of their dimensional model, and details the OneData methodology—including OneModel, OneID, and OneService—along with its practical implementation, results, and future plans.

Big DataData GovernanceData Warehouse

0 likes · 23 min read

Snowball Data Warehouse Modeling and OneData System Implementation

Big Data Technology & Architecture

Aug 4, 2022 · Big Data

Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment

This article provides a detailed overview of DataX, covering its purpose, framework design, core architecture, scheduling process, practical examples of MySQL-to-MySQL synchronization, step‑by‑step installation and configuration of DataX‑WEB, UI usage, routing strategies, task types, and advanced task building techniques.

Big DataData IntegrationDataX

0 likes · 14 min read

Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment

Alibaba Cloud Native

Jul 15, 2022 · Cloud Native

Boost Data Analysis and ETL with Alibaba Cloud Function Compute Async Tasks

This guide explains how to use Alibaba Cloud Function Compute asynchronous tasks for large‑scale data analysis, database autonomous services, Kafka‑based ETL pipelines, and high‑performance video transcoding, highlighting architecture migration, cost reduction, deployment steps, and observable serverless task capabilities.

Async TasksETLServerless

0 likes · 16 min read

Boost Data Analysis and ETL with Alibaba Cloud Function Compute Async Tasks

Programmer DD

Jul 14, 2022 · Big Data

Master Fast Data Synchronization with Alibaba DataX: A Step‑by‑Step Guide

This article explains why traditional mysqldump and file‑based methods struggle with massive tables, introduces Alibaba DataX as a high‑performance offline data integration tool, details its architecture, and provides comprehensive installation and configuration steps for full and incremental MySQL‑to‑MySQL synchronization using JSON job files.

Big DataDataXETL

0 likes · 15 min read

Master Fast Data Synchronization with Alibaba DataX: A Step‑by‑Step Guide

Baidu Geek Talk

Jun 15, 2022 · Big Data

Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges

The article proposes replacing the traditional multi‑layered data‑warehouse architecture (ODS‑DWD‑DWS‑ADS) with a single, column‑store wide‑table per business theme, achieving roughly 30 % storage savings and faster queries, while acknowledging higher ETL complexity, back‑tracking costs, and production timing challenges.

Big DataData WarehouseETL

0 likes · 11 min read

Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges

Architect's Tech Stack

May 28, 2022 · Big Data

Data Lake Challenges and the Open SPL Computing Engine

The article examines the inherent trade‑offs of data lakes—maintaining raw data, enabling efficient computation, and keeping costs low—explains why traditional data‑warehouse approaches fall short, and introduces the open‑source SPL engine that provides multi‑source, file‑based, high‑performance analytics to overcome these limitations.

Big DataData LakeETL

0 likes · 12 min read

Data Lake Challenges and the Open SPL Computing Engine

Architect

May 25, 2022 · Big Data

Metadata Infrastructure and Governance in Bilibili's Data Platform

The article details how Bilibili built a unified metadata infrastructure—including a URN‑based model, collection pipelines, quality assurance, storage in TiDB/ES/HugeGraph, and query services—to support data discovery, lineage, impact analysis, and governance across its growing data platform.

Big DataData CatalogData Governance

0 likes · 21 min read

Metadata Infrastructure and Governance in Bilibili's Data Platform

DataFunTalk

May 24, 2022 · Big Data

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

This article explains how Apache Flink integrates with Apache Hudi to enable real‑time data lake ingestion, covering the evolution from traditional data warehouses to data lakes, Hudi’s core concepts such as timeline and file grouping, copy‑on‑write vs merge‑on‑read modes, and Flink’s CDC‑based ETL pipeline.

Big DataCDCData Lake

0 likes · 18 min read

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

DataFunTalk

May 19, 2022 · Big Data

SeaTunnel: Distributed Data Integration Platform and Its Application in Traffic Management

This article introduces Apache SeaTunnel, a distributed, high‑performance data integration platform built on Spark and Flink, outlines its technical features, workflow, and plugin ecosystem, and details a concrete traffic‑management use case involving incremental Oracle‑to‑warehouse data synchronization with Spark resources and scheduled shell scripts.

Apache FlinkApache SparkBig Data

0 likes · 12 min read

SeaTunnel: Distributed Data Integration Platform and Its Application in Traffic Management

ITPUB

Apr 27, 2022 · Databases

Mastering Data Warehouse Standards: Architecture, Layer Design, and Naming Conventions

This comprehensive guide explains data‑warehouse construction standards, covering model architecture principles, public development rules, layer‑by‑layer design specifications, and systematic naming conventions for tables, dimensions, and metrics to ensure consistency, scalability, and reliable data governance.

Big DataData WarehouseDatabase Standards

0 likes · 26 min read

Mastering Data Warehouse Standards: Architecture, Layer Design, and Naming Conventions

Snowball Engineer Team

Apr 21, 2022 · Big Data

Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation

This article details the Snowball data team's migration from Hive3 on Tez to Spark SQL, covering the motivations, comparative performance tests, encountered compatibility issues, configuration work‑arounds, and future plans for consolidating ETL workloads on Spark.

Big DataData WarehouseETL

0 likes · 13 min read

Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation

DataFunSummit

Apr 4, 2022 · Big Data

User Portrait Scenarios and Technical Implementation Solutions

This article presents a comprehensive overview of user portrait applications across various industries, detailing common scenarios, product functionalities, and a step‑by‑step technical solution that includes data collection, tag management, ETL pipelines, and service architecture for real‑time and offline processing.

ETLSCRMTag Management

0 likes · 18 min read

User Portrait Scenarios and Technical Implementation Solutions

58 Tech

Mar 29, 2022 · Big Data

Design and Implementation of the 58 Group Penalty Data Center

This article presents the design, architecture, and implementation of a unified penalty data center for 58 Group, detailing the challenges of heterogeneous data sources, the selection of Flink for real‑time ETL, the use of a DSL and LRU aggregation, and the adoption of MVEL for feature recognition to achieve standardized, high‑performance penalty data processing.

Big DataData EngineeringETL

0 likes · 13 min read

Design and Implementation of the 58 Group Penalty Data Center

Big Data Technology & Architecture

Mar 8, 2022 · Big Data

Flink CDC 2.0: Concepts, Architecture, and Hands‑On Implementation

This article introduces the fundamentals of Flink CDC, explains its application scenarios and underlying technologies, compares query‑based and log‑based CDC, showcases open‑source solutions, and provides detailed Java and SQL examples for building real‑time ETL pipelines with MySQL and Flink.

Apache FlinkChange Data CaptureETL

0 likes · 24 min read

Flink CDC 2.0: Concepts, Architecture, and Hands‑On Implementation

DataFunTalk

Mar 5, 2022 · Big Data

Designing Cross‑Period Dependencies in Data Scheduling Systems

This article explains how data scheduling systems manage task execution, ETL processes, and cross‑period dependencies by linking task versions, data partitions, and time parameters, and introduces the offset‑and‑cnt model to express dynamic dependencies in big‑data pipelines.

DAGData SchedulingETL

0 likes · 14 min read

Designing Cross‑Period Dependencies in Data Scheduling Systems

ByteDance Data Platform

Feb 21, 2022 · Big Data

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

This article examines how to design enterprise‑grade data warehouses by evaluating development convenience, ecosystem, decoupling, performance and security, compares Hive and SparkSQL along with other engines such as Presto, Doris and ClickHouse, and outlines best‑practice component selections for long‑running batch and interactive analytics.

Big DataData WarehouseETL

0 likes · 19 min read

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

dbaplus Community

Feb 15, 2022 · Big Data

Mastering Data Warehouse Architecture: Concepts, Modeling Techniques, and Real‑Time Strategies

This comprehensive guide explains data warehouse fundamentals, architecture layers, modeling methods such as dimensional and entity modeling, metadata management, and the transition from offline to real‑time processing with Lambda and Kappa architectures, providing practical steps, best practices, and key terminology for building robust analytical platforms.

Big DataData WarehouseETL

0 likes · 63 min read

Mastering Data Warehouse Architecture: Concepts, Modeling Techniques, and Real‑Time Strategies

DataFunTalk

Feb 15, 2022 · Big Data

SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration

The article details Vipshop's multi‑dimensional use of SeaTunnel to integrate Hive and ClickHouse, describing data import/export challenges, tool selection among DataX, SeaTunnel and Spark, custom configurations, platform integration, and future improvements for high‑performance OLAP pipelines.

Big DataClickHouseData Integration

0 likes · 15 min read

SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration

DataFunTalk

Jan 22, 2022 · Big Data

Alibaba Cloud Data Integration (DataX) Architecture, Design Principles, and Solution Overview

This presentation details Alibaba Cloud DataWorks Data Integration (DataX), covering its architecture, core design principles, offline and real‑time synchronization mechanisms, deployment modes, product positioning, use‑case scenarios, and its role within the broader DataWorks ecosystem, highlighting its capabilities for large‑scale data movement and processing.

Alibaba CloudBig DataData Integration

0 likes · 19 min read

Alibaba Cloud Data Integration (DataX) Architecture, Design Principles, and Solution Overview

Big Data Technology & Architecture

Dec 31, 2021 · Big Data

Apache SeaTunnel Joins the Apache Incubator: Overview, Features, and Real‑World Use Cases

SeaTunnel, the China‑originated data‑integration platform built on Spark and Flink, has been accepted into the Apache Incubator, and this article introduces its history, architecture, plugin ecosystem, deployment requirements, and numerous enterprise deployments across batch and streaming big‑data scenarios.

Big DataData IntegrationETL

0 likes · 7 min read

Apache SeaTunnel Joins the Apache Incubator: Overview, Features, and Real‑World Use Cases

Big Data Technology & Architecture

Nov 30, 2021 · Big Data

User Portrait Development Process and Key Deliverables

This article outlines a comprehensive seven‑stage workflow for building enterprise user portraits—from goal interpretation and requirement analysis through tag development, scheduling, service‑layer integration, productization, optimization, and finally deployment and performance tracking—highlighting critical outputs and common challenges at each step.

Data EngineeringETLtag development

0 likes · 8 min read

User Portrait Development Process and Key Deliverables

Big Data Technology & Architecture

Nov 28, 2021 · Big Data

Designing Hive Data Warehouse Schemas: Fact & Dimension Tables, Partitioning, Tag Aggregation, and ID Mapping

This article explains how to design Hive data warehouse schemas, covering fact and dimension table modeling, partitioned storage strategies, tag aggregation techniques, and ID‑mapping implementations using Hive SQL and UDFs to support user profiling and analytics.

Big DataData WarehouseETL

0 likes · 15 min read

Designing Hive Data Warehouse Schemas: Fact & Dimension Tables, Partitioning, Tag Aggregation, and ID Mapping

dbaplus Community

Nov 27, 2021 · Big Data

How Vipshop’s Hera Data Service Boosts Big Data Access and Performance

The article details the design, architecture, core features, scheduling logic, and performance gains of Vipshop’s self‑built Hera data service, which unifies data‑warehouse access, supports multiple engines, adapts SQL execution, and dramatically improves SLA for both B‑to‑B and B‑to‑C workloads.

Big DataData ServiceDistributed Computing

0 likes · 22 min read

How Vipshop’s Hera Data Service Boosts Big Data Access and Performance

DataFunTalk

Nov 20, 2021 · Big Data

How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices

This article provides a comprehensive guide to designing and implementing a big‑data platform, covering architecture overview, data ingestion with Flume, storage on HDFS/Hive/HBase, processing engines such as Hive, Spark and Flink, scheduling solutions like Azkaban and Airflow, and the construction of self‑service analytics systems.

Big DataData EngineeringETL

0 likes · 29 min read

How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices

Big Data Technology Architecture

Nov 13, 2021 · Big Data

Case Study: Migrating Baicaowei's On‑Premise Hadoop Data Platform to Alibaba Cloud Native Data Lake

This article details Baicaowei's migration from an IDC‑hosted Hadoop cluster to a cloud‑native data lake on Alibaba Cloud, outlining the business drivers, pain points of the legacy platform, architectural goals, design principles, solution selection, implementation steps, and future outlook for the new big‑data ecosystem.

Alibaba CloudBig DataCloud Migration

0 likes · 16 min read

Case Study: Migrating Baicaowei's On‑Premise Hadoop Data Platform to Alibaba Cloud Native Data Lake

dbaplus Community

Oct 30, 2021 · Big Data

Building a Scalable Big Data Service Platform: Architecture & Low‑Code Orchestration

This article explains the end‑to‑end big data processing pipeline, outlines the diverse challenges of data interfaces, storage and performance, introduces the unified "Three Ones" approach, details a three‑layer service architecture, and shows how low‑code orchestration can streamline API creation and composition.

ETLSQLdata services

0 likes · 12 min read

Building a Scalable Big Data Service Platform: Architecture & Low‑Code Orchestration

Architects' Tech Alliance

Sep 11, 2021 · Big Data

Understanding Data Warehouses: Definitions, Differences, Architecture, Modeling, and Best Practices

This article explains what a data warehouse is, contrasts it with traditional databases, outlines how to design and build a warehouse—including model selection, subject‑area definition, bus matrix, layering, and data quality—while also covering related concepts such as data middle platforms, data lakes, metadata, and modeling techniques.

Big DataData QualityData Warehouse

0 likes · 16 min read

Understanding Data Warehouses: Definitions, Differences, Architecture, Modeling, and Best Practices

ITPUB

Sep 9, 2021 · Big Data

Why Data Lakes Are Essential for Modern Data Platforms: Goals, Architecture, and Governance

This article explains the origins and purpose of data lakes, outlines four key construction goals, details common ingestion methods and storage technologies, and describes essential governance practices such as cataloging, data quality, and regulatory compliance.

Data GovernanceData LakeETL

0 likes · 18 min read

Why Data Lakes Are Essential for Modern Data Platforms: Goals, Architecture, and Governance

Architects Research Society

Sep 7, 2021 · Big Data

Key Factors to Consider When Building Your Own Data Warehouse

This article examines essential considerations such as data volume, personnel support, scalability, and pricing models when selecting a data warehouse solution, comparing on‑premise options with modern cloud services like Redshift, BigQuery, and Snowflake for various workload sizes.

CloudData WarehouseELT

0 likes · 11 min read

dbaplus Community

Aug 31, 2021 · Big Data

How Meituan Waimai Built and Evolved Its Massive Data Warehouse from V1 to V3

This article details Meituan Waimai's data warehouse evolution—covering business context, four‑layer architecture, Spark‑based ETL, successive V1.0, V2.0, and V3.0 redesigns, data governance practices, resource‑optimization tactics, security measures, and future road‑maps—illustrated with diagrams and concrete technical choices.

Data GovernanceData SecurityETL

0 likes · 24 min read

How Meituan Waimai Built and Evolved Its Massive Data Warehouse from V1 to V3

The Dominant Programmer

Aug 24, 2021 · Databases

Syncing SQL Server Tables to MySQL with Kettle Open‑Source ETL

Kettle, a pure‑Java open‑source ETL tool (Chinese name "Shuihu"), enables efficient, installation‑free data extraction and synchronization, allowing you to copy tables or views from a SQL Server database directly into a MySQL table using a straightforward workflow.

Data MigrationETLJava

0 likes · 2 min read

Syncing SQL Server Tables to MySQL with Kettle Open‑Source ETL

DataFunSummit

Aug 22, 2021 · Big Data

Evolution and Optimization of Meituan Waimai Offline Data Warehouse: Architecture, ETL, Modeling, Governance, and Future Plans

This article details the historical development, architectural layers, ETL migration to Spark, data modeling standards, governance processes, resource optimization, security measures, and future roadmap of Meituan Waimai's offline data warehouse, illustrating how the team addressed scalability and efficiency challenges.

Big DataData GovernanceData Warehouse

0 likes · 21 min read

Evolution and Optimization of Meituan Waimai Offline Data Warehouse: Architecture, ETL, Modeling, Governance, and Future Plans

IT Architects Alliance

Aug 22, 2021 · Big Data

Understanding ETL and Building Enterprise Data Warehouses: Concepts, Architecture, and Step‑by‑Step Techniques

This article explains the fundamentals of ETL, describes data warehouse architectures such as star and snowflake schemas, outlines a five‑step methodology for constructing enterprise‑level data warehouses, and discusses advanced ETL techniques, tools, and algorithm choices for effective data integration and management.

DW ArchitectureData WarehouseETL

0 likes · 24 min read

Architect

Aug 21, 2021 · Databases

ETL and Data Warehouse Architecture: Concepts, Five‑Step Process, and Advanced Techniques

This article explains the fundamentals of ETL, describes data‑warehouse architectures such as star and snowflake schemas, outlines a five‑step enterprise‑level ETL workflow, and discusses advanced techniques, tools, and algorithms for building robust data‑warehouse solutions.

DWData IntegrationData Warehouse

0 likes · 24 min read

ETL and Data Warehouse Architecture: Concepts, Five‑Step Process, and Advanced Techniques

Qunar Tech Salon

Aug 16, 2021 · Operations

Design and Practice of Qunar Data Synchronization Platform: ES Multi‑Version Migration, High Availability, and Data Consistency

The article details Qunar's data synchronization platform that aggregates MySQL data into Elasticsearch, covering its architecture, component choices, ES5‑to‑ES7 migration, hot‑plugging, reindexing, high‑availability design, consistency guarantees, operational optimizations, and future roadmap.

Data synchronizationETLElasticsearch

0 likes · 16 min read

Design and Practice of Qunar Data Synchronization Platform: ES Multi‑Version Migration, High Availability, and Data Consistency

IT Architects Alliance

Aug 14, 2021 · Big Data

An Introduction to Dimensional Modeling in Data Warehousing

This article provides a comprehensive overview of data warehouse concepts, compares classic warehouse models, explains dimensional modeling fundamentals such as fact and dimension tables, demonstrates a practical e‑commerce scenario with schema design and SQL query examples, and discusses real‑world trade‑offs.

Big DataETLSQL

0 likes · 9 min read

An Introduction to Dimensional Modeling in Data Warehousing

IT Architects Alliance

Aug 9, 2021 · Big Data

Data Warehouse Architecture Overview: Layers, Sources, Modeling, Storage, and Management

This article explains the logical layered architecture of modern data warehouses, covering data sources, ODS, DW/DWS layers, collection, storage on HDFS, synchronization tools, dimensional modeling (star, snowflake, constellation), metadata management, and task scheduling and monitoring, highlighting best practices for scalable big‑data solutions.

Data WarehouseETLMetadata

0 likes · 12 min read

Data Warehouse Architecture Overview: Layers, Sources, Modeling, Storage, and Management

ITFLY8 Architecture Home

Aug 3, 2021 · Big Data

How BIGO Scaled Real‑Time Messaging by Migrating from Kafka to Pulsar

BIGO replaced its Kafka‑based message‑flow platform with Apache Pulsar to overcome scaling, stability, and operational cost challenges, leveraging Pulsar’s storage‑compute separation, seamless horizontal expansion, low latency, and tight integration with Flink for real‑time ETL and AB‑test pipelines, resulting in billions of messages processed daily with half the hardware cost.

Apache PulsarETLFlink

0 likes · 17 min read

How BIGO Scaled Real‑Time Messaging by Migrating from Kafka to Pulsar

ITFLY8 Architecture Home

Jul 7, 2021 · Big Data

Mastering Data Middle Platforms: From Ingestion to Real‑Time Analytics

This comprehensive guide explains the concepts, architecture, and best practices of data middle platforms, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, and implementation details for building scalable big‑data solutions.

Data GovernanceData PlatformETL

0 likes · 23 min read

Mastering Data Middle Platforms: From Ingestion to Real‑Time Analytics

Didi Tech

Jun 22, 2021 · Big Data

MySQL Binlog Real‑time Collection and Hive Ingestion at DiDi: Architecture and Practices

DiDi’s real‑time MySQL‑to‑Hive pipeline captures row‑mode binlog with a custom Canal component, converts it to JSON, streams it via Kafka to HDFS, restores it into Hive tables, and uses Dquality for integrity, achieving millisecond latency for over 19,000 daily sync tasks handling roughly 50 TB of data.

Big DataBinlogCanal

0 likes · 13 min read

MySQL Binlog Real‑time Collection and Hive Ingestion at DiDi: Architecture and Practices

DataFunTalk

Jun 11, 2021 · Big Data

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

This article explains how to efficiently move large volumes of data from Hive to HBase by leveraging HBase's bulkload mechanism, detailing the original MapReduce workflow, its performance bottlenecks, and a rewritten Spark‑based solution that simplifies ETL, improves partitioning, and achieves several‑fold speedup.

Big DataETLHBase

0 likes · 17 min read

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

Big Data Technology & Architecture

Jun 6, 2021 · Big Data

Understanding Data Warehouses: Concepts, Architecture, Modeling, and Governance

This article provides a comprehensive overview of data warehouses, explaining their purpose, differences from databases, OLTP vs OLAP, traditional versus internet data warehouse models, layered architecture, modeling theories, metric dictionaries, date dimensions, naming conventions, data governance, and incremental synchronization techniques with practical SQL examples.

Big DataData GovernanceETL

0 likes · 24 min read

Understanding Data Warehouses: Concepts, Architecture, Modeling, and Governance

dbaplus Community

Jun 2, 2021 · Databases

How to Build a Mature Data Warehouse: 7 Essential Steps and Best Practices

This article explains why data warehouses are critical for decision‑making, outlines the challenges of immature warehouses, and provides a step‑by‑step framework—including goal setting, technology selection, problem identification, domain modeling, layer design, modeling principles, and governance standards—to help teams build a robust, maintainable data warehouse.

Big DataData ArchitectureData Warehouse

0 likes · 22 min read

How to Build a Mature Data Warehouse: 7 Essential Steps and Best Practices

IT Architects Alliance

May 30, 2021 · Big Data

NetEase Game Streaming ETL Architecture and Practices Based on Flink

This article presents NetEase Game's Flink‑based streaming ETL system, detailing business background, log classifications, specialized and generic ETL services, Python UDF integration, runtime optimizations, HDFS write tuning, SLA metrics, fault‑tolerance mechanisms, and future roadmap for unified data lakes and PyFlink support.

Big DataData IntegrationETL

0 likes · 19 min read

NetEase Game Streaming ETL Architecture and Practices Based on Flink

IT Architects Alliance

May 25, 2021 · Big Data

How Modern Data Middle Platforms Power Real‑Time and Offline Analytics

This article provides a comprehensive technical overview of data middle platforms, covering data aggregation, offline and real‑time development, smart operations, data asset management, governance, service layers, platform implementations, warehouse layering, and key differences between offline and real‑time data warehouses.

Big DataData GovernanceData Platform

0 likes · 26 min read

How Modern Data Middle Platforms Power Real‑Time and Offline Analytics

Programmer DD

May 22, 2021 · Big Data

What Is a Data Lake? Origins, Architecture, and How It Powers Modern Big Data

This article explains the concept of a data lake—its origin in 2011, how it differs from traditional databases and data warehouses, its core characteristics such as raw data storage, on‑demand computing, and schema‑on‑read, as well as its advantages, challenges, architectural components, and future outlook within the big‑data ecosystem.

Big DataData ArchitectureData Governance

0 likes · 20 min read

What Is a Data Lake? Origins, Architecture, and How It Powers Modern Big Data

JD Retail Technology

May 13, 2021 · Big Data

Evolution and Architecture of JD.com Self‑Operated Rebate Platform

The article details the development, challenges, and redesign of JD.com’s self‑operated rebate system, describing its early monolithic architecture, data‑intensive processing pipeline, migration to a modular, high‑availability platform built on Spark, Hive, and Elasticsearch, and the resulting performance and operational improvements.

Big DataETLHigh Availability

0 likes · 16 min read

Evolution and Architecture of JD.com Self‑Operated Rebate Platform

Architecture Digest

May 7, 2021 · Big Data

Comprehensive Overview of Data Middle Platform Architecture and Practices

This article provides a detailed introduction to data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, monitoring, and deployment patterns, illustrating how enterprises build unified data ecosystems across various industries.

Big DataData GovernanceData Platform

0 likes · 25 min read

Comprehensive Overview of Data Middle Platform Architecture and Practices

ITFLY8 Architecture Home

May 3, 2021 · Big Data

Unlocking the Power of Data Middle Platforms: Key Concepts and Best Practices

This article provides a comprehensive overview of data middle platforms, covering data aggregation, collection tools, offline and real‑time development, scheduling, baseline control, heterogeneous storage, data governance, service layers, monitoring, and the architectural differences between offline and real‑time data warehouses.

Data WarehouseETLReal-time Processing

0 likes · 26 min read

Unlocking the Power of Data Middle Platforms: Key Concepts and Best Practices

Big Data Technology & Architecture

Mar 31, 2021 · Big Data

Practical Tips for Using Impala with Kudu for Real-Time Data Processing

This article provides step‑by‑step guidance on importing data into Kudu via Sqoop and Impala, performance tuning recommendations for Impala‑Kudu workloads, best practices for queries, data deletion, comparisons with Parquet, and a brief overview of StreamSets as an ETL tool.

Data WarehouseETLImpala

0 likes · 6 min read

Practical Tips for Using Impala with Kudu for Real-Time Data Processing

Aikesheng Open Source Community

Mar 25, 2021 · Databases

Using MySQL LOAD DATA for Importing Text and Fixed‑Length Files

This article explains how to use MySQL's LOAD DATA command to import CSV and fixed‑length text files, covering basic parameters, sample data and table structures, four practical scenarios, and best‑practice tips for handling mismatched schemas and large files.

ETLFixed-lengthMySQL

0 likes · 12 min read

Using MySQL LOAD DATA for Importing Text and Fixed‑Length Files

DataFunTalk

Mar 24, 2021 · Big Data

Practical Experience of Using DorisDB for Real-Time and Offline Analytics in KuJiaLe's Big Data Platform

This article details how KuJiaLe's big data team replaced their legacy ADB and Presto clusters with a DorisDB MPP database, achieving sub‑second query latency, unified real‑time and offline analytics, simplified ETL pipelines, and significant cost savings while supporting billion‑row tables and high‑QPS workloads.

Big DataDorisDBETL

0 likes · 9 min read

Practical Experience of Using DorisDB for Real-Time and Offline Analytics in KuJiaLe's Big Data Platform

Big Data Technology Architecture

Mar 23, 2021 · Big Data

Integrating Apache Kylin with MLSQL for In‑Place ETL and Analytics

The article explains how Apache Kylin and MLSQL complement each other, detailing Kylin's OLAP strengths, MLSQL's data‑processing and AI capabilities, and demonstrates a low‑code integration that enables users to perform ETL directly within Kylin’s interface while outlining future deep‑link scenarios.

Data IntegrationETLKylin

0 likes · 10 min read

Integrating Apache Kylin with MLSQL for In‑Place ETL and Analytics

DataFunTalk

Mar 11, 2021 · Big Data

Data Warehouse Modeling Architecture and Methodology: Building Robust, High‑Quality Data Models

This article explains the importance of data‑warehouse modeling, outlines a layered architecture (DWD, DWS, DIM, ADS), describes a systematic modeling process, and presents design principles and practical examples to achieve high‑quality, stable, and efficient data models for large‑scale analytics.

DWDData ArchitectureData Warehouse

0 likes · 11 min read

Data Warehouse Modeling Architecture and Methodology: Building Robust, High‑Quality Data Models

DataFunTalk

Mar 7, 2021 · Big Data

Building Stream‑Batch Integrated ETL with Flink SQL: Data Warehouse and Data Integration

This article explains how Flink SQL can be used to construct a unified stream‑batch ETL pipeline for data warehouses and data lakes, covering data integration, CDC support, streaming writes to Hive and Iceberg, and various join techniques such as regular, interval, and temporal joins.

CDCData IntegrationETL

0 likes · 20 min read

Building Stream‑Batch Integrated ETL with Flink SQL: Data Warehouse and Data Integration

Big Data Technology & Architecture

Mar 2, 2021 · Big Data

An Introduction to Kafka Connect: Architecture, Components, and Hands‑On Setup

This article introduces Kafka Connect, explaining its purpose as a scalable and reliable tool for moving data between Apache Kafka and external systems, detailing its core concepts, architecture, deployment modes, configuration files, and a step‑by‑step example that streams data from a file source to a file sink.

Data IntegrationETLStreaming

0 likes · 12 min read

An Introduction to Kafka Connect: Architecture, Components, and Hands‑On Setup

dbaplus Community

Feb 23, 2021 · Big Data

How NetEase Game Teams Built a Scalable Flink‑Based Streaming ETL Platform

This article explains how NetEase games collect heterogeneous logs, design a Flink‑driven streaming ETL pipeline, handle schema‑free sources, implement Python UDFs with Jython, optimize HDFS writes, manage real‑time and offline warehouses, and share practical tuning and fault‑tolerance techniques.

ETLFlinkHive

0 likes · 22 min read

How NetEase Game Teams Built a Scalable Flink‑Based Streaming ETL Platform

Architects' Tech Alliance

Feb 21, 2021 · Big Data

Data Warehouse and Data Lake: Concepts, Architecture, and Comparison

This article provides an extensive overview of data warehouse and data lake concepts, their architectures, differences, components, and implementation considerations, covering topics such as OLTP/OLAP, ETL processes, data quality, cloud solutions, and the role of data platforms in modern enterprises.

Cloud ComputingData ArchitectureData Lake

0 likes · 92 min read

Data Warehouse and Data Lake: Concepts, Architecture, and Comparison

DataFunTalk

Feb 10, 2021 · Big Data

AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case

The article presents Entropy Simplify's AirWorks data intelligence platform, detailing its three‑layer architecture, cloud‑native multi‑source data ingestion system, low‑code ETL capabilities, technical features such as multi‑engine cooperation and data‑skew handling, and a financial asset‑management case study.

Big DataETLdata ingestion

0 likes · 16 min read

AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case

ITFLY8 Architecture Home

Feb 4, 2021 · Big Data

Unlocking Data Middle Platform: From Ingestion to Real‑Time Analytics

This article provides a comprehensive overview of data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, scheduling, baseline control, heterogeneous storage, recommendation dependencies, data permissions, layered data architecture (ODS, DW, DWD, DWS, TDM, ADS), asset management, governance, service APIs, query and analysis services, as well as monitoring, alerting, and operational best practices for building robust big‑data solutions.

Big DataData WarehouseETL

0 likes · 25 min read

Unlocking Data Middle Platform: From Ingestion to Real‑Time Analytics

dbaplus Community

Jan 18, 2021 · Backend Development

How to Build a Scalable Elasticsearch Sync Framework for Real-Time Business Search

This article explains Ctrip's design and implementation of a flexible Elasticsearch data‑synchronization framework that handles full, incremental, ID‑based, and time‑based syncing from multiple sources, addressing the shortcomings of existing tools and simplifying complex data assembly for business search.

ETLElasticsearchIndexing

0 likes · 10 min read

How to Build a Scalable Elasticsearch Sync Framework for Real-Time Business Search

NetEase Smart Enterprise Tech+

Jan 14, 2021 · Big Data

How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM

Yidun’s public-opinion monitoring platform transforms massive raw web data into a unified format by separating dynamic Groovy-script-driven cleaning from static processing, achieving real-time source integration, high throughput, scalability, and high availability while addressing format diversity, team coordination, and performance-flexibility trade-offs.

Big DataETLGroovy

0 likes · 5 min read

How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM

58 Tech

Jan 4, 2021 · Big Data

Building a Real‑Time Data Warehouse with Flink: Architecture, Implementation and Lessons Learned

This article describes how a fast‑growing company built a layered real‑time data warehouse on Flink, detailing the evolution from a simple 1.0 pipeline to a 2.0 architecture with ODS, DWD and ADS layers, dimension joins, exactly‑once sinks, HDFS partitioning, monitoring, and future improvements.

Big DataETLFlink

0 likes · 14 min read

Building a Real‑Time Data Warehouse with Flink: Architecture, Implementation and Lessons Learned

Big Data Technology & Architecture

Jan 3, 2021 · Big Data

A Comprehensive Introduction to Apache Airflow: Architecture, Installation, and Usage

This article provides an in‑depth overview of Apache Airflow, covering its core concepts, advantages, architecture components, installation steps, example ETL DAG code, common command‑line tools, and practical tips for leveraging Airflow in data engineering workflows.

AirflowData EngineeringETL

0 likes · 13 min read

Architect

Dec 22, 2020 · Big Data

Dimensional Modeling in Data Warehousing: Concepts, Theory, and Practical Example

This article explains data warehouse fundamentals, reviews classic warehouse models such as ER, dimensional, Data Vault and Anchor, then dives deep into dimensional modeling concepts, star and snowflake schemas, and demonstrates a practical e‑commerce scenario with SQL examples and trade‑offs.

Big DataData WarehouseETL

0 likes · 11 min read

Dimensional Modeling in Data Warehousing: Concepts, Theory, and Practical Example

Big Data Technology & Architecture

Dec 21, 2020 · Big Data

Understanding Slowly Changing Dimensions (SCD) in Data Warehousing

The article explains the concept of Slowly Changing Dimensions (SCD) in data warehouses, illustrates why tracking dimension changes is essential for accurate historical analysis, and details Kimball's five primary SCD types (0‑4) with their implementation strategies and trade‑offs.

Data WarehouseETLSCD Types

0 likes · 7 min read

Understanding Slowly Changing Dimensions (SCD) in Data Warehousing

ITFLY8 Architecture Home

Dec 18, 2020 · Big Data

Unlocking the Data Middle Platform: From Ingestion to Real‑Time Analytics

This article provides a comprehensive overview of data middle platform concepts, covering data aggregation, collection tools, development modules, job scheduling, baseline control, heterogeneous storage, permission management, real‑time and offline processing, governance, services, and implementation details for building robust big‑data solutions.

Data GovernanceData PlatformETL

0 likes · 25 min read

Unlocking the Data Middle Platform: From Ingestion to Real‑Time Analytics