Tagged articles

ETL

304 articles · Page 3 of 4

Dec 11, 2020 · Big Data

Data Synchronization from MySQL to Elasticsearch using DataX and Canal

The article explains how to improve query performance by flattening multi‑table MySQL data and synchronizing it to Elasticsearch—using DataX for one‑time bulk loading and Canal (with Canal‑Adapter) for real‑time binlog‑driven incremental updates—while detailing configuration steps, job examples, and common pitfalls.

CanalData synchronizationDataX

0 likes · 14 min read

Data Synchronization from MySQL to Elasticsearch using DataX and Canal

Big Data Technology & Architecture

Nov 29, 2020 · Big Data

Installing and Configuring Kettle (Pentaho Data Integration) on Linux for Hadoop ETL

This guide provides a step‑by‑step tutorial on preparing a Linux environment, installing Java, GNOME Desktop, VNC remote access, Chinese language support, downloading and extracting Kettle, configuring its startup scripts, creating desktop shortcuts, and managing essential Kettle configuration files for successful Hadoop ETL development.

ETLInstallationKettle

0 likes · 37 min read

Installing and Configuring Kettle (Pentaho Data Integration) on Linux for Hadoop ETL

Big Data Technology & Architecture

Nov 28, 2020 · Big Data

ETL Fundamentals and Introduction to Kettle (Pentaho Data Integration)

This article provides an in-depth overview of ETL concepts, including extraction, transformation, loading, data warehouse architecture, and detailed discussion of Kettle (Pentaho Data Integration) features, design principles, components, transformations, jobs, database connections, metadata management, and practical examples for building robust data integration pipelines.

Data IntegrationData WarehouseETL

0 likes · 57 min read

ETL Fundamentals and Introduction to Kettle (Pentaho Data Integration)

dbaplus Community

Nov 26, 2020 · Big Data

Silicon Valley's Data Middle Platform Secrets: EA, Twitter, Airbnb, Uber

This article examines how leading Silicon Valley companies such as EA, Twitter, Airbnb, and Uber design and operate data middle platforms—detailing their architectures, data collection pipelines, standardization efforts, real‑time and batch processing, and the business impact of shared data capabilities.

Big DataCloudData Architecture

0 likes · 25 min read

Silicon Valley's Data Middle Platform Secrets: EA, Twitter, Airbnb, Uber

DataFunTalk

Nov 26, 2020 · Big Data

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology

This article details the evolution of 58.com’s commercial data warehouse across three phases—1.0, 2.0, and 3.0—covering its scale, four‑layer architecture, migration from legacy Hadoop‑MapReduce pipelines to Flume/Kafka and Flink streaming, code optimizations, monitoring, and productization for real‑time business insights.

Big DataETLHadoop

0 likes · 9 min read

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology

DataFunTalk

Nov 18, 2020 · Big Data

Meituan Waimai Traffic Data Collection, Data Warehouse Construction, and Application Practices

This article details Meituan Waimai's traffic data collection history, the design and implementation of its large‑scale data warehouse—including ODL, IDL, CDL, MDL, and DIM layers—along with attribution modeling, data governance, and practical applications for analytics and product development.

Data WarehouseETLMeituan

0 likes · 26 min read

Meituan Waimai Traffic Data Collection, Data Warehouse Construction, and Application Practices

Beike Product & Technology

Nov 13, 2020 · Big Data

Beike One‑Stop Big Data Development Platform: Architecture, Evolution, and Future Outlook

The article summarizes Beike's one‑stop big data development platform, describing its data business background, the evolution from a simple Hadoop‑Kafka‑Hive stack to a metadata‑driven, asset‑oriented platform, and outlines current capabilities in data management, integration, scheduling, quality, openness, and future plans.

Big DataData EngineeringData Governance

0 likes · 11 min read

Beike One‑Stop Big Data Development Platform: Architecture, Evolution, and Future Outlook

Big Data Technology & Architecture

Nov 4, 2020 · Big Data

Comprehensive Overview of Data Warehouse Concepts, Architecture, and Modeling

This article provides an extensive introduction to data warehouses, covering their origins, development, definition, advantages, components, comparisons with databases, ODS and data marts, architectural approaches, modeling techniques, and dimensional modeling processes for enterprise‑level analytics.

Data WarehouseETLInmon

0 likes · 47 min read

Comprehensive Overview of Data Warehouse Concepts, Architecture, and Modeling

dbaplus Community

Nov 3, 2020 · Big Data

How Ctrip Boosted Hotel Data Warehouse Performance 400% with ClickHouse

Ctrip’s hotel data team tackled a 3 TB daily data load by building a ClickHouse cluster on VMware, creating custom sync and execution tools, applying query optimizations, and handling merge and memory errors, ultimately achieving over 400% performance gains across multiple reporting themes.

Big DataClickHouseData Warehouse

0 likes · 7 min read

How Ctrip Boosted Hotel Data Warehouse Performance 400% with ClickHouse

DataFunTalk

Oct 29, 2020 · Big Data

Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Lyft transformed its legacy data pipeline by designing a cloud‑native, Flink‑based near real‑time analytics platform that ingests billions of events, writes Parquet files to S3, leverages Presto for interactive queries, and implements multi‑stage non‑blocking ETL, fault‑tolerant back‑fill, and extensive performance optimizations.

AWSData LakeETL

0 likes · 12 min read

Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Big Data Technology & Architecture

Oct 21, 2020 · Big Data

An Introduction to Apache Hudi: Concepts, Design Principles, and Architecture

This article introduces Apache Hudi, explaining its core concepts, design principles, table architecture, write and compaction mechanisms, and the three query modes that enable efficient batch and incremental processing on modern data lakes.

Apache HudiBig DataData Lake

0 likes · 21 min read

An Introduction to Apache Hudi: Concepts, Design Principles, and Architecture

DataFunTalk

Oct 9, 2020 · Big Data

NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation

This article examines the pain points of traditional data warehouse platforms, explains the core concepts and advantages of the Iceberg data lake table format, compares it with Metastore, reviews the current Iceberg community ecosystem, and details NetEase’s practical integration with Hive, Impala, and Flink to improve ETL efficiency and support unified batch‑stream processing.

Data LakeETLFlink

0 likes · 13 min read

NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation

JD Retail Technology

Sep 28, 2020 · Artificial Intelligence

Why AI Testing Is Still Painful and How to Solve It

The talk explores the current pain points of AI testing, outlines data‑quality analysis methods, highlights critical ETL and model‑testing considerations, and shares practical case studies and platform designs to improve machine‑learning quality assurance.

AI testingData QualityETL

0 likes · 5 min read

Why AI Testing Is Still Painful and How to Solve It

DataFunTalk

Sep 25, 2020 · Big Data

Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap

The article details Meituan Waimai's offline data warehouse evolution from its initial V1.0 design through V2.0 improvements to the V3.0 modeling‑tool driven architecture, covering the four‑layer framework, Spark‑based ETL, data governance processes, resource optimization, security measures, and future development plans.

Big DataData GovernanceETL

0 likes · 22 min read

Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap

Fulu Network R&D Team

Sep 21, 2020 · Big Data

Data Development and Testing: Process, Key Concerns, and Quality Monitoring

This article outlines the data development lifecycle, distinguishes it from application development, details the responsibilities and focus areas for data testers, and presents a comprehensive end‑to‑end quality monitoring and alert system for big‑data pipelines.

Data TestingETLQuality Monitoring

0 likes · 14 min read

Data Development and Testing: Process, Key Concerns, and Quality Monitoring

Huawei Cloud Developer Alliance

Sep 15, 2020 · Big Data

Mastering ETL: 8 Essential Algorithms for Modern Data Warehouses

This article explains why ETL is a critical step in building data warehouses, introduces eight core ETL algorithms—including full delete/insert, upsert, append, and various link‑table models—describes their ideal use cases, and provides ready‑to‑run SQL code examples for each.

Big DataData WarehouseETL

0 likes · 12 min read

Mastering ETL: 8 Essential Algorithms for Modern Data Warehouses

Didi Tech

Aug 24, 2020 · Big Data

Evolution and Architecture of DiDi Data Channel Service

DiDi’s Data Channel Service evolved from a fragmented component system into a unified, SLA‑driven platform with a UI‑based Sync Center and Flink‑powered StreamSQL engine, dramatically improving task creation speed, resource utilization, and reliability while automating issue diagnosis for company‑wide real‑time and offline data synchronization.

Big DataData synchronizationETL

0 likes · 12 min read

Evolution and Architecture of DiDi Data Channel Service

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Project: Refactoring, Scala Traits, ETL Processor, and Project Entry

This article walks through a Spark and Kudu advertising project, explaining the refactoring approach, Scala trait usage, implementation of ETL and province‑city statistics processors, and shows the complete Spark application entry point with full code examples.

Big DataETLKudu

0 likes · 7 min read

Spark + Kudu Advertising Project: Refactoring, Scala Traits, ETL Processor, and Project Entry

Architects Research Society

Aug 20, 2020 · Big Data

Differences Between Talend and Pentaho ETL Tools

The article explains the fundamentals of ETL, compares Talend and Pentaho in terms of openness, connectivity, support, performance, GUI usability, deployment flexibility, and cost, and concludes with guidance on choosing the appropriate tool based on specific business and technical requirements.

ComparisonData IntegrationETL

0 likes · 7 min read

Differences Between Talend and Pentaho ETL Tools

Big Data Technology & Architecture

Aug 19, 2020 · Big Data

Big Data ETL Project: Parsing Advertising JSON with Spark, IP Lookup, and Storing into Kudu

This tutorial describes how to place advertising JSON data on HDFS, use Spark for ETL and analysis, enrich logs with IP lookup, and persist the results into Kudu with daily scheduling, including code examples and schema definitions.

Big DataETLIP lookup

0 likes · 17 min read

Big Data ETL Project: Parsing Advertising JSON with Spark, IP Lookup, and Storing into Kudu

Architects' Tech Alliance

Aug 11, 2020 · Big Data

Comprehensive Overview of Data Middle Platform Architecture, Components, and Practices

This article provides an extensive summary of data middle platform concepts, covering data aggregation, collection tools, offline and real‑time development, data governance, service layers, warehouse construction, and operational practices, illustrating how enterprises build and manage a unified data ecosystem.

Big DataData GovernanceData Middle Platform

0 likes · 27 min read

Comprehensive Overview of Data Middle Platform Architecture, Components, and Practices

Architects' Tech Alliance

Aug 5, 2020 · Big Data

Data Middle Platform: Concepts, Architecture, and Implementation

This article provides a comprehensive overview of data middle platforms, covering data aggregation, collection tools, offline and real‑time development, job scheduling, data governance, multi‑layer architecture, ETL processes, and various industry use cases, illustrating how enterprises build and manage unified data assets.

Data PlatformETL

0 likes · 23 min read

Data Middle Platform: Concepts, Architecture, and Implementation

DataFunTalk

Aug 2, 2020 · Big Data

Building Real-Time Data Warehouses with Apache Flink: Goals, Architecture, and Best Practices

This article presents a comprehensive guide to constructing real-time data warehouses using Apache Flink, covering the motivations, design principles, application scenarios, layer-by-layer architecture, metadata and lineage management, quality assurance, and the supporting toolchain for reliable streaming analytics.

Data ArchitectureETLFlink

0 likes · 24 min read

Building Real-Time Data Warehouses with Apache Flink: Goals, Architecture, and Best Practices

21CTO

Aug 1, 2020 · Big Data

Mastering User Profiling: A Comprehensive Big Data Blueprint

This article explains how enterprises can leverage massive raw and business data to build detailed user profiles, covering tag types, data architecture, development modules, project phases, key deliverables, and a real-world e‑commerce case study.

Big DataData WarehouseETL

0 likes · 22 min read

Mastering User Profiling: A Comprehensive Big Data Blueprint

DataFunTalk

Aug 1, 2020 · Big Data

User Profiling Methodology and Engineering Solutions

This article explains the fundamentals of user profiling in the big data era, covering tag types, data architecture, development modules, a step‑by‑step implementation process, a practical e‑commerce case study, table design strategies, and both quantitative and qualitative profiling methods.

Big DataETLMachine Learning

0 likes · 22 min read

User Profiling Methodology and Engineering Solutions

Beike Product & Technology

Jul 16, 2020 · Backend Development

Kafka Connect: Introduction and Concepts for Data Pipelines

This article introduces Kafka Connect, a framework for building scalable data pipelines between Kafka and other systems, covering its architecture, key concepts like connectors and tasks, and practical deployment examples.

Backend DevelopmentBig DataETL

0 likes · 20 min read

Kafka Connect: Introduction and Concepts for Data Pipelines

58 Tech

Jul 13, 2020 · Big Data

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Monitoring, and Metadata Management

This article presents a comprehensive design and implementation guide for a financial data warehouse, covering background needs, modeling methodology choices, a layered architecture, data quality monitoring, metadata management, naming and coding standards, and future development directions.

Big DataData QualityData Warehouse

0 likes · 11 min read

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Monitoring, and Metadata Management

Sohu Tech Products

Jul 8, 2020 · Big Data

Optimizing Workflow in Data Warehouse Construction: A Layered Task‑Instance Approach

The article analyzes data‑warehouse workflow scenarios, explains core concepts such as OLAP, multidimensional modeling and layer architecture, reviews existing workflow engines like Azkaban, Oozie and Airflow, and proposes a task‑and‑instance layered optimization that simplifies dependency configuration, improves collaboration, and supports complex scheduling in modern big‑data environments.

Big DataETLTask scheduling

0 likes · 21 min read

Optimizing Workflow in Data Warehouse Construction: A Layered Task‑Instance Approach

Big Data Technology Architecture

Jul 8, 2020 · Big Data

Key Interview Questions on Data Warehousing, Data Platforms, and Related Technologies

This article compiles a comprehensive set of 32 interview questions covering data warehouse fundamentals, data platform construction, modeling approaches, real‑time architectures, data quality, governance, Hive optimization, and related analytical techniques to help candidates prepare for data engineering roles.

Data PlatformData WarehouseETL

0 likes · 4 min read

Key Interview Questions on Data Warehousing, Data Platforms, and Related Technologies

Big Data Technology Architecture

Jun 11, 2020 · Big Data

Optimizing Workflow in Data Warehouse Construction

This article analyzes workflow scenarios in data warehouse construction, proposes an optimization scheme that abstracts workflow nodes into task and instance layers, and demonstrates how task attributes and generation rules can improve configurability, dependency management, and collaborative development for large‑scale data warehouse projects.

Big DataETLTask scheduling

0 likes · 19 min read

Optimizing Workflow in Data Warehouse Construction

DataFunTalk

Jun 6, 2020 · Big Data

Optimizing Workflow in Data Warehouse Construction: A Task‑Instance Layered Approach

The article analyzes workflow scenarios in data‑warehouse projects, proposes a two‑level model that abstracts workflow nodes into tasks and instances, defines period and dependency attributes, and presents generation rules that simplify configuration, improve collaboration, and support complex data‑processing schedules in modern big‑data environments.

Data WarehouseETLTask scheduling

0 likes · 19 min read

Optimizing Workflow in Data Warehouse Construction: A Task‑Instance Layered Approach

Big Data Technology & Architecture

Jun 4, 2020 · Big Data

Building a Data Warehouse: Architecture, Storage Selection, Dimensional Modeling, and ETL with Airflow

This article describes the design and implementation of a data warehouse, covering storage engine choices, dimensional modeling techniques, ETL processes using Python scripts, and workflow management with Apache Airflow to address data integration, scalability, and maintenance challenges.

AirflowETLMySQL

0 likes · 11 min read

Building a Data Warehouse: Architecture, Storage Selection, Dimensional Modeling, and ETL with Airflow

HomeTech

May 20, 2020 · Big Data

AutoHome Data Warehouse Architecture and Layered Model Design

This article describes AutoHome's data warehouse architecture, detailing its background, business pain points, layered model design (RDM, ADM, GDM, FDM, Stage/BDM, DIM, TMP), advantages in performance, cost, efficiency, quality, and various application scenarios including BI, analytics, and decision support.

BIETLdata modeling

0 likes · 10 min read

AutoHome Data Warehouse Architecture and Layered Model Design

dbaplus Community

Mar 19, 2020 · Big Data

Inside Ctrip Flight Ticket Data Warehouse: Evolution, Architecture, and Real‑Time Challenges

This article details the evolution of Ctrip's flight ticket data warehouse, describing its historical tech stack, current architecture—including Hive, Presto, ClickHouse, CrateDB, and Flink—data synchronization methods, layer design, quality monitoring, and a real‑time price‑monitoring use case.

Big DataCtripData Quality

0 likes · 19 min read

Inside Ctrip Flight Ticket Data Warehouse: Evolution, Architecture, and Real‑Time Challenges

Youzan Coder

Mar 18, 2020 · Big Data

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

The article traces Youzan’s data warehouse from its chaotic early days lacking structure, through a 2016 Airflow‑driven construction phase that introduced layered ODS/DW/Data Mart architecture and naming standards, to a mature stage focused on efficiency, security, SparkSQL, dimensional modeling, metadata, and ongoing real‑time and governance challenges.

AirflowBig DataData Governance

0 likes · 20 min read

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

Ctrip Technology

Feb 20, 2020 · Big Data

Ctrip Flight Ticket Data Warehouse: Architecture, Technology Stack, and Practical Practices

This article outlines Ctrip's flight ticket data warehouse evolution, current big‑data technology stack, data synchronization methods, layered architecture, quality monitoring system, and a real‑time price anomaly detection case, providing practical insights for building scalable, reliable data warehousing solutions.

CtripData QualityData Warehouse

0 likes · 20 min read

Ctrip Flight Ticket Data Warehouse: Architecture, Technology Stack, and Practical Practices

Yanxuan Tech Team

Feb 17, 2020 · Big Data

Why Data Warehouses Matter: From Basics to the Hadoop Ecosystem

This article explains the purpose of data as a strategic asset, compares traditional databases with data warehouses, outlines key characteristics and related concepts of data warehouses, and introduces the Hadoop ecosystem components that support large‑scale data storage and analysis.

AnalyticsETLHadoop

0 likes · 14 min read

Why Data Warehouses Matter: From Basics to the Hadoop Ecosystem

58 Tech

Feb 10, 2020 · Big Data

Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com

This article systematically describes the challenges, design principles, modeling methods, layered architecture, implementation steps, and standards used in building a comprehensive user behavior data warehouse for 58.com, highlighting practical experiences and future improvement directions.

Big DataData QualityData Warehouse

0 likes · 11 min read

Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com

dbaplus Community

Jan 14, 2020 · Big Data

How OPPO Built a Real‑Time Data Warehouse with Flink SQL

This article details{32-64 words} OPPO's evolution from an offline data warehouse to a real‑time platform, describing the business scale, data‑mid platform architecture, migration strategy using Flink SQL, extensions like AthenaX, and practical use cases such as real‑time ETL, CTR calculation, and tag import.

Data EngineeringETLFlink

0 likes · 18 min read

How OPPO Built a Real‑Time Data Warehouse with Flink SQL

Big Data Technology & Architecture

Jan 2, 2020 · Databases

Comprehensive Overview of Data Warehousing, ETL, OLAP, and Data Cube Operations

This article provides a thorough introduction to data warehousing, covering warehouse creation, the ETL process, OLAP/BI tools, data cube concepts, common OLAP operations, and the three main OLAP architectural models (MOLAP, ROLAP, HOLAP).

Data WarehouseETLOLAP

0 likes · 12 min read

Comprehensive Overview of Data Warehousing, ETL, OLAP, and Data Cube Operations

Big Data Technology & Architecture

Jan 1, 2020 · Big Data

Understanding the Origins, Significance, and Construction of Data Warehouses

This article explains the historical background of databases and data warehouses, outlines why data warehouses are essential for modern enterprises, and provides a step‑by‑step guide to building a data warehouse using Kimball’s dimensional modeling approach.

ETLKimballdimensional modeling

0 likes · 8 min read

Understanding the Origins, Significance, and Construction of Data Warehouses

Architecture Digest

Dec 26, 2019 · Databases

Data Warehouse Fundamentals, Modeling Techniques, and the Evolution of Maoyan’s Warehouse

This article explains the origins and challenges of scattered enterprise data, defines the data warehouse concept, details its four core characteristics, compares entity, normalization, and dimensional modeling methods, and illustrates Maoyan’s three‑stage data‑warehouse evolution with practical examples and diagrams.

Data WarehouseETLdimensional modeling

0 likes · 17 min read

Data Warehouse Fundamentals, Modeling Techniques, and the Evolution of Maoyan’s Warehouse

Architecture Digest

Dec 24, 2019 · Big Data

Design Architecture and Technical Strategies for Big Data Products

This article systematically outlines the architecture and technical strategy of big‑data product design, detailing a five‑step process from front‑end data collection and ETL to data warehousing, modeling, algorithm design, and personalized user‑centric delivery, while highlighting common platform challenges and future deep‑learning enhancements.

Data ArchitectureETLuser profiling

0 likes · 14 min read

Design Architecture and Technical Strategies for Big Data Products

vivo Internet Technology

Dec 18, 2019 · Big Data

Comprehensive Overview of Big Data Architecture, Lambda/Kappa Models, and End-to-End Data Platform Design

The article surveys modern big‑data architecture, contrasting Lambda and Kappa models, highlights common governance and integration pain points, and proposes an end‑to‑end platform featuring unified metadata, stream‑batch processing, one‑click ingestion, standardized modeling, intelligent query abstraction, and a comprehensive development IDE.

Big DataData PlatformETL

0 likes · 13 min read

Comprehensive Overview of Big Data Architecture, Lambda/Kappa Models, and End-to-End Data Platform Design

YooTech Youzu Tech Team

Nov 28, 2019 · Big Data

How Data Ingestion Evolved at Youzu: From HTTP to Real‑Time DTS & ETL

This article traces the evolution of Youzu's data platform ingestion, comparing early HTTP/script methods with modern DTS and real‑time ETL solutions, evaluating middleware choices, detailing core system architectures, and outlining future improvements for reliable, scalable data access.

Big DataDTSETL

0 likes · 6 min read

How Data Ingestion Evolved at Youzu: From HTTP to Real‑Time DTS & ETL

Big Data Technology & Architecture

Nov 28, 2019 · Big Data

Resolving Unsupported Oracle Data Types in Spark SQL via Custom JdbcDialects

This article explains how to overcome Spark SQL's inability to handle certain Oracle data types, such as Timestamp with local timezone and FLOAT(126), by creating and registering a custom JdbcDialect that remaps unsupported types to compatible Spark types.

Big DataCustom DialectETL

0 likes · 8 min read

Resolving Unsupported Oracle Data Types in Spark SQL via Custom JdbcDialects

HomeTech

Nov 27, 2019 · Databases

Migrating AutoHome Community from SQL Server to TiDB: Architecture, Testing, and Lessons Learned

This article details the AutoHome community's migration from a monolithic SQL Server database to the distributed TiDB platform, covering the performance bottlenecks that prompted the change, the evaluation of candidate databases, extensive OLTP/OLAP testing, the full‑ and incremental‑sync migration strategy, rollback mechanisms, and the resulting operational improvements.

ETLSQL ServerTiDB

0 likes · 15 min read

Migrating AutoHome Community from SQL Server to TiDB: Architecture, Testing, and Lessons Learned

Big Data Technology & Architecture

Nov 25, 2019 · Big Data

Lightweight Dimension Table Join in Flink Using a Scheduled Cache

The article demonstrates how to enrich Flink streaming ETL jobs with slowly changing dimension data by periodically loading MySQL tables into an in‑memory cache and performing a simple map‑side join within a custom RichMapFunction implementation.

CacheDimension joinETL

0 likes · 5 min read

Lightweight Dimension Table Join in Flink Using a Scheduled Cache

DataFunTalk

Nov 19, 2019 · Big Data

Comprehensive Overview of Data Warehouses: Concepts, Evolution, Architecture, and Real‑time vs Offline Practices

This article provides a thorough introduction to data warehouses, traces their evolution, explains construction methodologies, compares offline, Lambda, and Kappa architectures, and presents real‑time warehouse case studies from Alibaba, Meituan, Xiaomi, Netflix, and OPPO, highlighting practical implementation details and challenges.

Data WarehouseETLFlink

0 likes · 14 min read

Comprehensive Overview of Data Warehouses: Concepts, Evolution, Architecture, and Real‑time vs Offline Practices

Qunar Tech Salon

Nov 18, 2019 · Databases

Data Synchronization Architecture and Refactoring for Large-Scale Travel Data at Qunar

This article describes the challenges of handling billions of travel records in Qunar's MySQL databases, compares open‑source data sync solutions like Databus and Canal, outlines the legacy system’s issues, and presents a refactored architecture that introduces Otter, ES gateway, and improved aggregation to achieve low‑latency, reliable, and scalable data synchronization.

Data synchronizationDatabasesETL

0 likes · 19 min read

Data Synchronization Architecture and Refactoring for Large-Scale Travel Data at Qunar

Architects Research Society

Oct 23, 2019 · Big Data

Talend Performance Tuning Strategy: Identifying and Eliminating Bottlenecks

This article presents a structured, repeatable approach for Talend data‑integration jobs that guides readers through pinpointing performance bottlenecks, testing individual pipeline stages, and applying targeted optimizations to sources, targets, and transformations to achieve higher throughput and more reliable ETL processes.

Bottleneck AnalysisData IntegrationETL

0 likes · 9 min read

Talend Performance Tuning Strategy: Identifying and Eliminating Bottlenecks

Mafengwo Technology

Sep 26, 2019 · Big Data

Mafengwo’s Data Warehouse & Middle Platform: Architecture, Modeling, Toolchain

This article details Mafengwo’s journey in constructing a data warehouse and data middle platform, covering the core three‑layer architecture, hybrid modeling approaches, the supporting toolchain for data synchronization, scheduling, and metadata management, and the design of an indicator platform for business analytics.

Big Data ArchitectureData Middle PlatformData Warehouse

0 likes · 18 min read

Mafengwo’s Data Warehouse & Middle Platform: Architecture, Modeling, Toolchain

Big Data Technology & Architecture

Sep 13, 2019 · Big Data

Data Warehouse Overview, Architecture, and Modeling Methodology

This article provides a comprehensive introduction to data warehouses, covering their definition, architectural layers, characteristics, modeling approaches such as Inmon and Kimball, fact and dimension table design, star and snowflake schemas, and best‑practice principles for building scalable, maintainable warehouse solutions in the big‑data ecosystem.

Database DesignETLOLAP

0 likes · 19 min read

Data Warehouse Overview, Architecture, and Modeling Methodology

Big Data Technology & Architecture

Sep 11, 2019 · Big Data

Big Data Technology and Architecture: Case Studies of Taobao, Didi, and Meituan

This article reviews the evolution and key components of big data platforms at leading Chinese internet companies—Taobao, Didi, and Meituan—detailing their data sources, synchronization tools, storage layers, processing engines, and scheduling systems to provide practical guidance for building robust big data infrastructures.

Big DataData PlatformETL

0 likes · 9 min read

Big Data Technology and Architecture: Case Studies of Taobao, Didi, and Meituan

58 Tech

Sep 6, 2019 · Big Data

Architecture and Technical Implementation of the WMDA Data Analytics Platform

The article details WMDA's end‑to‑end data analytics architecture, covering zero‑event data collection, real‑time and offline processing pipelines built on Spark Streaming, Druid, Hadoop, Kettle, and TaskServer, and explains how these components collaborate to deliver comprehensive user behavior analysis.

Big DataDruidETL

0 likes · 11 min read

Architecture and Technical Implementation of the WMDA Data Analytics Platform

Big Data Technology & Architecture

Aug 27, 2019 · Big Data

Building a Data Warehouse: Architecture, ETL, Layering, Modeling, and Governance

This article explains how to build a data warehouse from scratch, covering its definition, system and collaboration layers, ETL requirements, data layering design, modeling steps, common challenges, and governance practices such as temporary table management and coding standards.

Big DataData GovernanceData Warehouse

0 likes · 13 min read

Building a Data Warehouse: Architecture, ETL, Layering, Modeling, and Governance

dbaplus Community

Jul 24, 2019 · Big Data

Essential Open-Source Tools Every Big Data Engineer Should Know

This article compiles a comprehensive list of common open‑source tools for big data platforms—covering programming languages, data collection, ETL, storage, analysis, query, management, and monitoring—to help learners and practitioners quickly locate and understand the technologies they need.

Big DataData EngineeringETL

0 likes · 15 min read

Essential Open-Source Tools Every Big Data Engineer Should Know

Big Data Technology Architecture

Jul 16, 2019 · Big Data

Optimizing HBase‑to‑Hive Data Transfer with SnapshotScanMR to Reduce RegionServer Load

The article describes how a large‑scale ETL process that previously used HBaseStorageHandler caused severe region server pressure, and how a new HBase‑to‑Hive task based on SnapshotScanMR was designed to bypass region servers, halve execution time, and double scanning performance.

ETLHBaseHive

0 likes · 6 min read

Optimizing HBase‑to‑Hive Data Transfer with SnapshotScanMR to Reduce RegionServer Load

Zhongtong Tech

Jul 5, 2019 · Big Data

How SnapshotScanMR Doubles HBase‑to‑Hive ETL Speed and Relieves Cluster Load

This article explains how leveraging HBase's SnapshotScanMR feature to create a custom hbase2hiveBySnapshot task dramatically reduces region server pressure, halves ETL execution time, and improves cluster stability for large‑scale data back‑fill operations.

Big DataETLHBase

0 likes · 6 min read

How SnapshotScanMR Doubles HBase‑to‑Hive ETL Speed and Relieves Cluster Load

DataFunTalk

Jul 1, 2019 · Artificial Intelligence

Data-Driven Foundations for Building Recommendation Systems

The article explains how data serves as a critical asset for recommendation systems, outlining the necessary steps from understanding business problems and data dimensions to collection, cleaning, integration, and analysis, while distinguishing explicit and implicit user feedback and emphasizing data quality, timeliness, and relevance.

Data QualityETLRecommendation Systems

0 likes · 11 min read

Data-Driven Foundations for Building Recommendation Systems

Dada Group Technology

Jun 11, 2019 · Big Data

Building and Evolving the Dada‑JD Daojia Big Data Platform: Architecture, Strategies, and Lessons Learned

This article presents a comprehensive case study of the Dada‑JD Daojia big data platform, detailing its evolution from a MySQL‑based warehouse to a multi‑layered One Data, One Platform, One Service, Many Apps architecture, the technical challenges faced, and the strategic approaches adopted to ensure coverage, accuracy, stability, and scalability.

Big DataCase StudyData Governance

0 likes · 14 min read

Building and Evolving the Dada‑JD Daojia Big Data Platform: Architecture, Strategies, and Lessons Learned

Youzan Coder

Mar 22, 2019 · Big Data

Design and Implementation of a DataX‑Based Data Synchronization Platform at Youzan

Youzan replaced Sqoop with a customized DataX‑based platform that integrates with its offline scheduler to reliably sync MySQL, HBase, Elasticsearch and file sources to Hive, handling schema changes, sharding, rate‑limiting and logging, and has processed billions of rows daily with high stability.

DataXETLHive

0 likes · 15 min read

Design and Implementation of a DataX‑Based Data Synchronization Platform at Youzan

dbaplus Community

Feb 28, 2019 · Big Data

How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink

This article details Zhihu's evolution of its real-time data warehouse, covering the 1.0 version built on Spark Streaming, the 2.0 upgrade using Flink Streaming SQL, architectural layers, ETL processes, and future directions such as streaming SQL platformization and automated result validation.

ETLFlinkLambda architecture

0 likes · 19 min read

How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink

dbaplus Community

Jan 23, 2019 · Big Data

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

This article explains Zhihu's journey from ad‑hoc MySQL‑Hive sync using Oozie + Sqoop to a unified, platform‑based data synchronization service that now handles thousands of tables, over 10 TB daily, with load‑aware scheduling, incremental pulls, schema change handling, and tight integration with their offline job scheduler.

Big DataData synchronizationDataX

0 likes · 14 min read

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

360 Quality & Efficiency

Jan 2, 2019 · Big Data

Understanding ETL and Data Warehouses: A Beginner’s Guide

This article introduces the fundamentals of Business Intelligence, explains what ETL and data warehouses are, compares them with traditional databases, and outlines the main characteristics and popular tools such as Hive used in modern big‑data environments.

BIBig DataData Integration

0 likes · 5 min read

Understanding ETL and Data Warehouses: A Beginner’s Guide

Efficient Ops

Dec 24, 2018 · Operations

How Baidu’s Noah Platform Unifies Ops Data with Pull, Push, and Lazy ETL

This article explains how Baidu Cloud's Noah intelligent operations product builds a unified operations knowledge base by categorizing metadata, status, and event data and applying three ETL approaches—Pull, Push, and Lazy—to handle offline, near‑line, and real‑time data integration.

Cloud ComputingData IntegrationETL

0 likes · 8 min read

How Baidu’s Noah Platform Unifies Ops Data with Pull, Push, and Lazy ETL

iQIYI Technical Product Team

Oct 19, 2018 · Backend Development

Design and Implementation of an Operational Backend System Using ETL, Metadata, and Business Object Model

The paper outlines the three‑generation evolution of a video‑platform operational backend—from Apollo to Eight—meeting cross‑business integration, low‑cost development, and user experience goals by employing a metadata‑driven ETL layer, a unified business‑object model, and a componentized UI within a micro‑kernel, plugin‑based architecture, delivering decoupling, rapid configuration, data safety, and dynamically generated pages, while future work expands UI components, source support, deep‑operation features, and PaaS/open‑source release.

ETLarchitecturebusiness object model

0 likes · 12 min read

Design and Implementation of an Operational Backend System Using ETL, Metadata, and Business Object Model

dbaplus Community

Oct 18, 2018 · Big Data

How FunData Scaled DOTA2 Esports Data with a Cloud‑Native Big Data Architecture

This article details the evolution of the FunData esports data platform from a simple master‑slave ETL system to a cloud‑native, distributed architecture that leverages Google Cloud Pub/Sub, Dataflow, Bigtable, and a redesigned API layer to handle petabyte‑scale, real‑time DOTA2 match data.

ETLGoogle Cloud Platformesports

0 likes · 13 min read

How FunData Scaled DOTA2 Esports Data with a Cloud‑Native Big Data Architecture

360 Quality & Efficiency

Oct 15, 2018 · Big Data

An Introduction to Big Data Concepts, Hadoop Ecosystem, and Common Frameworks

This article provides a comprehensive overview of big data fundamentals, including the 4V characteristics, the Hadoop 2.0 layered architecture, a comparison between Hadoop and Spark, classification of common big‑data tools, and the typical offline and real‑time data processing workflows.

ETLHadoopSpark

0 likes · 6 min read

An Introduction to Big Data Concepts, Hadoop Ecosystem, and Common Frameworks

Big Data and Microservices

Aug 21, 2018 · Big Data

How to Build a Scalable Hadoop‑Spark Big Data Analytics Platform

This article explains why BI is essential for big data platforms, outlines the value hierarchy of data, details the Hadoop‑based analysis workflow, and provides step‑by‑step guidance for constructing both pure Hadoop and hybrid Hadoop‑Spark analytics architectures.

BIBig Data ArchitectureData Lake

0 likes · 12 min read

How to Build a Scalable Hadoop‑Spark Big Data Analytics Platform

Meitu Technology

Aug 14, 2018 · Big Data

Meitu Data Platform Architecture and Practices

Meitu’s data platform, serving dozens of apps with 500 million monthly active users and billions of daily events, combines the Arachnia log‑collection system, Kafka ingestion, multi‑layer storage (HDFS, MongoDB, HBase, Elasticsearch), offline Hive/MapReduce processing and real‑time Storm/Flink/Naix pipelines, supported by data‑workshop tools, staged evolution for scalability, and robust security and query‑validation mechanisms.

Big DataData EngineeringData Platform

0 likes · 16 min read

Meitu Data Platform Architecture and Practices

Architects Research Society

Jul 27, 2018 · Big Data

Overview of Apache Hive Features, Usage, and Management

Apache Hive is an open‑source data‑warehouse system built on Hadoop that enables users to read, write, and manage large distributed datasets using SQL‑like queries, offering features such as ETL support, various file‑format connectors, extensible UDFs, and integration with tools like Tez, Spark, and MapReduce.

Apache HiveBig DataData Warehouse

0 likes · 5 min read

Overview of Apache Hive Features, Usage, and Management

360 Tech Engineering

Jul 13, 2018 · Big Data

Titan 2.0 Big Data Processing Platform: Architecture Evolution and Practice

The article describes the evolution of 360's Titan big‑data processing platform through three architectural stages, details its functional modules, explains the DITTO component framework, context and rule‑engine abstractions, and shares practical case studies and personal insights on building a flexible, self‑service data platform.

Big DataDITTOETL

0 likes · 12 min read

Titan 2.0 Big Data Processing Platform: Architecture Evolution and Practice

Hujiang Technology

Jun 20, 2018 · Databases

Deep Dive into Yugong: Architecture, Core Modules, and Custom Enhancements for Database Migration

This article introduces Yugong, an open‑source ETL framework for heterogeneous database migration, explains its core Extractor‑Translator‑Applier architecture, details key classes and interfaces, discusses limitations of the original version, and describes extensive refactoring and new features added to support SQL Server, MySQL, and Canal‑based incremental replication.

ETLJavaYugong

0 likes · 9 min read

Deep Dive into Yugong: Architecture, Core Modules, and Custom Enhancements for Database Migration

dbaplus Community

Jun 14, 2018 · Big Data

Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices

This article explains how enterprises can build a scalable data analytics platform on Hadoop by outlining the multi‑layer architecture, storage options, data synchronization methods, and ETL/offline computation techniques, while highlighting practical component choices such as Hive, HBase, Spark, and Oozie.

Big DataData ArchitectureData Lake

0 likes · 10 min read

Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices

dbaplus Community

May 28, 2018 · Databases

How to Migrate from SQL Server to MySQL: Strategies, Tools, and CDC Implementation

This article details the challenges of moving from SQL Server to MySQL, compares offline and online migration approaches, presents a complete toolbox—including ETL utilities, consistency checkers, CDC configuration, and rollback mechanisms—and shares practical lessons from a large‑scale production migration.

CDCETLMySQL

0 likes · 26 min read

How to Migrate from SQL Server to MySQL: Strategies, Tools, and CDC Implementation

Hujiang Technology

Mar 13, 2018 · Databases

Migrating from SQL Server to MySQL: Strategies, Tools, and Lessons Learned

This article details the background, design considerations, migration workflows, tooling choices, data consistency verification, rollback mechanisms, and practical experiences of moving a large‑scale production environment from Microsoft SQL Server to MySQL, covering both offline and online migration scenarios.

Data ConsistencyETLMySQL

0 likes · 13 min read

Migrating from SQL Server to MySQL: Strategies, Tools, and Lessons Learned

Meituan Technology Team

Feb 2, 2018 · Big Data

How Meituan’s “Flow Compass” Turns Massive User Data into Actionable Insights

This article details the design, challenges, and implementation of Meituan’s Flow Compass—a data‑driven product that combines user, scene, and traffic source dimensions using a Kylin‑based warehouse to enable rapid, flexible traffic‑source analysis for hotel‑travel growth.

Big DataData WarehouseETL

0 likes · 19 min read

How Meituan’s “Flow Compass” Turns Massive User Data into Actionable Insights

Qunar Tech Salon

Dec 7, 2017 · Big Data

User Behavior Data Collection and Real-Time Processing Architecture at Qunar

This article describes Qunar's end‑to‑end user behavior data pipeline, covering offline and real‑time ETL processes, system architecture, Dubbo service interfaces, monitoring, optimizations, and the numerous product applications that leverage the unified behavior dataset.

ETLRecommendation Systemsdata pipeline

0 likes · 15 min read

User Behavior Data Collection and Real-Time Processing Architecture at Qunar

Liulishuo Tech Team

Oct 22, 2017 · Big Data

Data-CI: A SQL-Based Data Unit Testing Framework for ETL

The article introduces data-ci, a SQL‑driven unit testing framework that lets engineers write, organize, and automate data validation tests for ETL pipelines, providing assertions, failure callbacks, coverage reporting, and CI integration to improve data quality and reliability.

Big DataData QualityData Testing

0 likes · 9 min read

Data-CI: A SQL-Based Data Unit Testing Framework for ETL

21CTO

Oct 14, 2017 · Backend Development

How etlpy Simplifies Python Web Scraping and Data Cleaning in Under 500 Lines

etlpy is a lightweight Python framework that lets you define web‑crawling and data‑cleaning pipelines via XML, using generators for streaming, built‑in thread pools for parallelism, and a plug‑in architecture that handles everything from regex parsing to JSON conversion, all within a single 500‑line core file.

ETLWeb Scrapingdata cleaning

0 likes · 14 min read

How etlpy Simplifies Python Web Scraping and Data Cleaning in Under 500 Lines

ITPUB

Sep 30, 2017 · Big Data

Designing Scalable Open‑Source ETL Systems: Lessons from Baidu Waimai

This talk details Baidu Waimai's end‑to‑end ETL design, covering demand sources, data flow patterns, multi‑stage system evolution, storage choices, scheduling architecture, configuration‑driven processing, quality monitoring, and how data lineage enables transparent, self‑service data delivery.

Big DataData QualityData Warehouse

0 likes · 25 min read

Designing Scalable Open‑Source ETL Systems: Lessons from Baidu Waimai

ITPUB

Sep 29, 2017 · Big Data

Designing an Open ETL System: Baidu Waimai’s Scalable Data Pipeline Practices

In this talk, a Baidu Waimai engineer explains the motivations, requirements, and architectural choices behind their open‑source ETL platform, covering data flow patterns, logical mappings, storage options, scheduling, metadata management, and quality monitoring to achieve scalable, transparent, and explainable data delivery.

Big DataData EngineeringETL

0 likes · 26 min read

Designing an Open ETL System: Baidu Waimai’s Scalable Data Pipeline Practices

dbaplus Community

Sep 20, 2017 · Big Data

Scaling TB‑Level Price Computations with Apache Spark: Suning’s Architecture and Optimizations

This article details how Suning built a Hadoop‑based big data platform and leveraged Apache Spark to process terabytes of product price and inventory data, describing the system architecture, four key technical practices, performance results, and future data‑lake directions.

Apache SparkDataFramesDistributed Computing

0 likes · 12 min read

Scaling TB‑Level Price Computations with Apache Spark: Suning’s Architecture and Optimizations

Architecture Digest

Sep 2, 2017 · Big Data

Designing a High‑Availability, High‑Efficiency Distributed Scheduling Platform for Big Data

This article examines the principles, features, and implementation details of distributed scheduling for big‑data ETL pipelines, covering decentralised schedulers, host selection strategies, fault‑tolerance, operator abstraction, elasticity, trigger mechanisms, visual monitoring, alarm handling, data fan‑in/fan‑out, parameter consistency, real‑time quality checks, lineage tracking, and field‑level traceability.

Big DataDistributed SchedulingETL

0 likes · 23 min read

Designing a High‑Availability, High‑Efficiency Distributed Scheduling Platform for Big Data

Tongcheng Travel Technology Center

Aug 31, 2017 · Big Data

Evolution and Architecture of the Transportation Division Data Warehouse

The article details how the Transportation Division’s data warehouse grew from a simple SQL‑based solution to a multi‑layer, big‑data platform handling petabyte‑scale data with daily 10 TB increments, describing the technical and business architecture, ETL strategies, and future roadmap.

Big DataData ArchitectureData Warehouse

0 likes · 10 min read

Evolution and Architecture of the Transportation Division Data Warehouse

Ctrip Technology

Aug 10, 2017 · Big Data

Design and Implementation of Ctrip's Large-Scale Data Platform

This article details the architectural choices, component selection, performance tuning, and team organization behind Ctrip's big‑data platform, covering Kafka, Presto, Elasticsearch, Gobblin, Zeppelin, REST APIs, and job scheduling to achieve scalable, interactive data analysis and visualization.

ETLElasticsearchpresto

0 likes · 18 min read

Design and Implementation of Ctrip's Large-Scale Data Platform

dbaplus Community

Aug 3, 2017 · Big Data

How Ctrip Built a Scalable Data Platform with Presto, Elasticsearch, and Gobblin

This article summarizes Xu Peng's DAMS 2017 presentation on selecting big‑data platform components, designing ETL pipelines, choosing analysis engines, optimizing Elasticsearch, and building a data‑driven team at Ctrip.

Big Data ArchitectureCluster TuningData Platform

0 likes · 23 min read

How Ctrip Built a Scalable Data Platform with Presto, Elasticsearch, and Gobblin

StarRing Big Data Open Lab

Jul 28, 2017 · Big Data

How Transwarp Transporter Enables Near‑Real‑Time ETL in Big Data Pipelines

The article introduces Transwarp Transporter, a near‑real‑time ETL tool for TDH 5.x, explains its architecture, visual dashboard, drag‑and‑drop data‑flow design, debugging features, parameter management, and highlights how it empowers business users to achieve fast, reliable data migration in big‑data environments.

Data IntegrationETLTranswarp

0 likes · 7 min read

How Transwarp Transporter Enables Near‑Real‑Time ETL in Big Data Pipelines

Architecture Digest

Jul 22, 2017 · Big Data

Popular Big Data Tools and Their Descriptions

This article provides an extensive overview of more than ninety open‑source and commercial big‑data tools—including ETL platforms, resource managers, storage systems, messaging queues, processing engines, and visualization libraries—detailing their core functions, typical use cases, and notable adopters.

AnalyticsBig DataData Integration

0 likes · 26 min read

Popular Big Data Tools and Their Descriptions

Architecture Digest

May 25, 2017 · Big Data

Designing Data Warehouse Layers: Principles, Models, and Practical Practices

This article explains why data warehouses should be layered, describes the classic ODS‑DW‑APP model, details each layer’s purpose and implementation techniques, presents an improved layering scheme with dimension and temporary tables, and answers common questions about parallel DWS and DWD processing.

Big DataData ArchitectureData Warehouse

0 likes · 17 min read

Designing Data Warehouse Layers: Principles, Models, and Practical Practices

dbaplus Community

Apr 17, 2017 · Databases

Mastering Oracle‑to‑MySQL Migration: Tools, Pitfalls, and Performance Tweaks

This article shares practical experiences and step‑by‑step guidance for migrating databases from Oracle to MySQL, covering pre‑migration preparation, target selection, data‑object migration tools such as SQL LOAD, Python scripts, Oracle GoldenGate, MySQL Migration Toolkit and Kettle, handling of views, triggers, stored procedures, data validation techniques, and key MySQL performance parameters.

ETLMySQLOracle

0 likes · 26 min read

Mastering Oracle‑to‑MySQL Migration: Tools, Pitfalls, and Performance Tweaks

StarRing Big Data Open Lab

Mar 10, 2017 · Big Data

Boost ETL Performance: Practical Tips to Optimize Inceptor Transaction Tables

This article shares actionable ETL tuning strategies for Inceptor, including avoiding unnecessary transaction tables, shortening transaction windows, analyzing the most impactful cases first, and iteratively refining steps until optimal performance is achieved.

ETLInceptorPerformance Tuning

0 likes · 6 min read

Boost ETL Performance: Practical Tips to Optimize Inceptor Transaction Tables

StarRing Big Data Open Lab

Mar 3, 2017 · Big Data

Boost ETL Performance: Key Tips for Resources, Partitioning & Monitoring

Effective ETL optimization is crucial for data warehouse construction, and this guide outlines three core strategies—ensuring proper resource configuration, leveraging data characteristics for optimal partitioning and bucketing, and monitoring task execution—providing practical principles, pitfalls, and case studies to maximize ETL efficiency.

BucketingETLPartitioning

0 likes · 11 min read

Boost ETL Performance: Key Tips for Resources, Partitioning & Monitoring

Architecture Digest

Feb 11, 2017 · Big Data

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

The article describes how LeKe Sports built and continuously upgraded its Hadoop‑based big data platform—from a manual ETL‑to‑Elasticsearch reporting system to a 2.0 architecture featuring Spark Streaming, SQL‑based query layers, Elasticsearch indexing, and cloud‑native storage and backup solutions—to meet rapidly growing PB‑scale data demands.

Big DataData PlatformETL

0 likes · 5 min read

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

dbaplus Community

Jan 8, 2017 · Big Data

How to Build a Cost‑Effective Data Platform for Small‑to‑Medium Enterprises

This article explains why data platforms are essential for modern SMEs, defines what a data platform is, outlines a four‑step methodology (source definition, analysis theme, ETL processing, and reporting), and shares architectural choices, team structures, common pitfalls, and practical advice for rapid, iterative implementation.

Data ArchitectureData PlatformData Warehouse

0 likes · 15 min read

How to Build a Cost‑Effective Data Platform for Small‑to‑Medium Enterprises

Architects' Tech Alliance

Nov 30, 2016 · Big Data

Core Technologies and Challenges of Big Data: ETL, Storage, Analysis, and Cloud Integration

This article examines the core technologies of big data—including data collection, storage, management, analysis, and mining—highlighting architectural challenges, analysis techniques, storage solutions, ETL processes, and the interplay between big data and cloud computing, while emphasizing practical implementation considerations.

Cloud ComputingETLdata analysis

0 likes · 11 min read

Core Technologies and Challenges of Big Data: ETL, Storage, Analysis, and Cloud Integration

ITPUB

Nov 2, 2016 · Databases

Mastering Oracle GoldenGate: Architecture, Components, and Configuration Guide

This article provides a comprehensive overview of Oracle GoldenGate, detailing its supported databases, modular architecture, key components such as Extract, Data Pump, Replicat, Trails, Checkpoints, Manager and Collector, as well as processing types, group configuration, and commit sequence numbers for reliable data replication.

Change Data CaptureData ReplicationETL

0 likes · 20 min read

Mastering Oracle GoldenGate: Architecture, Components, and Configuration Guide

ITPUB

Jul 19, 2016 · Big Data

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

The talk shares hands‑on experiences and best‑practice methods for traditional data‑warehouse processing, public and behavioral data handling in big‑data environments, and practical guidance for migrating legacy warehouses to modern Hadoop‑based platforms, emphasizing data governance, security, and performance optimization.

Big DataData GovernanceData Warehouse

0 likes · 13 min read

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

Architecture Digest

Apr 9, 2016 · Big Data

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

This article describes how Meituan migrated from Hive‑SQL and MapReduce to Spark on YARN, built an interactive Zeppelin‑based development platform, created reusable ETL templates, constructed a Spark‑driven feature and data‑mining platform, and applied Spark to interactive user‑behavior analysis and large‑scale SEM services, highlighting performance gains and operational benefits.

Big DataData PlatformDistributed Computing

0 likes · 19 min read

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications