Tagged articles
297 articles
Page 3 of 3
dbaplus Community
dbaplus Community
Nov 3, 2020 · Big Data

How Ctrip Boosted Hotel Data Warehouse Performance 400% with ClickHouse

Ctrip’s hotel data team tackled a 3 TB daily data load by building a ClickHouse cluster on VMware, creating custom sync and execution tools, applying query optimizations, and handling merge and memory errors, ultimately achieving over 400% performance gains across multiple reporting themes.

Big DataClickHouseETL
0 likes · 7 min read
How Ctrip Boosted Hotel Data Warehouse Performance 400% with ClickHouse
DataFunTalk
DataFunTalk
Oct 29, 2020 · Big Data

Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Lyft transformed its legacy data pipeline by designing a cloud‑native, Flink‑based near real‑time analytics platform that ingests billions of events, writes Parquet files to S3, leverages Presto for interactive queries, and implements multi‑stage non‑blocking ETL, fault‑tolerant back‑fill, and extensive performance optimizations.

AWSData LakeETL
0 likes · 12 min read
Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink
DataFunTalk
DataFunTalk
Oct 9, 2020 · Big Data

NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation

This article examines the pain points of traditional data warehouse platforms, explains the core concepts and advantages of the Iceberg data lake table format, compares it with Metastore, reviews the current Iceberg community ecosystem, and details NetEase’s practical integration with Hive, Impala, and Flink to improve ETL efficiency and support unified batch‑stream processing.

Data LakeETLFlink
0 likes · 13 min read
NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation
JD Retail Technology
JD Retail Technology
Sep 28, 2020 · Artificial Intelligence

Why AI Testing Is Still Painful and How to Solve It

The talk explores the current pain points of AI testing, outlines data‑quality analysis methods, highlights critical ETL and model‑testing considerations, and shares practical case studies and platform designs to improve machine‑learning quality assurance.

AI testingData QualityETL
0 likes · 5 min read
Why AI Testing Is Still Painful and How to Solve It
DataFunTalk
DataFunTalk
Sep 25, 2020 · Big Data

Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap

The article details Meituan Waimai's offline data warehouse evolution from its initial V1.0 design through V2.0 improvements to the V3.0 modeling‑tool driven architecture, covering the four‑layer framework, Spark‑based ETL, data governance processes, resource optimization, security measures, and future development plans.

Big DataData GovernanceETL
0 likes · 22 min read
Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap
Didi Tech
Didi Tech
Aug 24, 2020 · Big Data

Evolution and Architecture of DiDi Data Channel Service

DiDi’s Data Channel Service evolved from a fragmented component system into a unified, SLA‑driven platform with a UI‑based Sync Center and Flink‑powered StreamSQL engine, dramatically improving task creation speed, resource utilization, and reliability while automating issue diagnosis for company‑wide real‑time and offline data synchronization.

Big DataETLFlink
0 likes · 12 min read
Evolution and Architecture of DiDi Data Channel Service
Architects Research Society
Architects Research Society
Aug 20, 2020 · Big Data

Differences Between Talend and Pentaho ETL Tools

The article explains the fundamentals of ETL, compares Talend and Pentaho in terms of openness, connectivity, support, performance, GUI usability, deployment flexibility, and cost, and concludes with guidance on choosing the appropriate tool based on specific business and technical requirements.

ComparisonData IntegrationETL
0 likes · 7 min read
Differences Between Talend and Pentaho ETL Tools
Architects' Tech Alliance
Architects' Tech Alliance
Aug 11, 2020 · Big Data

Comprehensive Overview of Data Middle Platform Architecture, Components, and Practices

This article provides an extensive summary of data middle platform concepts, covering data aggregation, collection tools, offline and real‑time development, data governance, service layers, warehouse construction, and operational practices, illustrating how enterprises build and manage a unified data ecosystem.

Big DataData GovernanceData Middle Platform
0 likes · 27 min read
Comprehensive Overview of Data Middle Platform Architecture, Components, and Practices
Architects' Tech Alliance
Architects' Tech Alliance
Aug 5, 2020 · Big Data

Data Middle Platform: Concepts, Architecture, and Implementation

This article provides a comprehensive overview of data middle platforms, covering data aggregation, collection tools, offline and real‑time development, job scheduling, data governance, multi‑layer architecture, ETL processes, and various industry use cases, illustrating how enterprises build and manage unified data assets.

Data PlatformETL
0 likes · 23 min read
Data Middle Platform: Concepts, Architecture, and Implementation
21CTO
21CTO
Aug 1, 2020 · Big Data

Mastering User Profiling: A Comprehensive Big Data Blueprint

This article explains how enterprises can leverage massive raw and business data to build detailed user profiles, covering tag types, data architecture, development modules, project phases, key deliverables, and a real-world e‑commerce case study.

Big DataETLSpark
0 likes · 22 min read
Mastering User Profiling: A Comprehensive Big Data Blueprint
DataFunTalk
DataFunTalk
Aug 1, 2020 · Big Data

User Profiling Methodology and Engineering Solutions

This article explains the fundamentals of user profiling in the big data era, covering tag types, data architecture, development modules, a step‑by‑step implementation process, a practical e‑commerce case study, table design strategies, and both quantitative and qualitative profiling methods.

Big DataETLmachine learning
0 likes · 22 min read
User Profiling Methodology and Engineering Solutions
Sohu Tech Products
Sohu Tech Products
Jul 8, 2020 · Big Data

Optimizing Workflow in Data Warehouse Construction: A Layered Task‑Instance Approach

The article analyzes data‑warehouse workflow scenarios, explains core concepts such as OLAP, multidimensional modeling and layer architecture, reviews existing workflow engines like Azkaban, Oozie and Airflow, and proposes a task‑and‑instance layered optimization that simplifies dependency configuration, improves collaboration, and supports complex scheduling in modern big‑data environments.

Big DataETLdependency management
0 likes · 21 min read
Optimizing Workflow in Data Warehouse Construction: A Layered Task‑Instance Approach
Big Data Technology Architecture
Big Data Technology Architecture
Jul 8, 2020 · Big Data

Key Interview Questions on Data Warehousing, Data Platforms, and Related Technologies

This article compiles a comprehensive set of 32 interview questions covering data warehouse fundamentals, data platform construction, modeling approaches, real‑time architectures, data quality, governance, Hive optimization, and related analytical techniques to help candidates prepare for data engineering roles.

Data PlatformETLdata modeling
0 likes · 4 min read
Key Interview Questions on Data Warehousing, Data Platforms, and Related Technologies
Big Data Technology Architecture
Big Data Technology Architecture
Jun 11, 2020 · Big Data

Optimizing Workflow in Data Warehouse Construction

This article analyzes workflow scenarios in data warehouse construction, proposes an optimization scheme that abstracts workflow nodes into task and instance layers, and demonstrates how task attributes and generation rules can improve configurability, dependency management, and collaborative development for large‑scale data warehouse projects.

Big DataETLdependency management
0 likes · 19 min read
Optimizing Workflow in Data Warehouse Construction
DataFunTalk
DataFunTalk
Jun 6, 2020 · Big Data

Optimizing Workflow in Data Warehouse Construction: A Task‑Instance Layered Approach

The article analyzes workflow scenarios in data‑warehouse projects, proposes a two‑level model that abstracts workflow nodes into tasks and instances, defines period and dependency attributes, and presents generation rules that simplify configuration, improve collaboration, and support complex data‑processing schedules in modern big‑data environments.

ETLdata-warehousedependency management
0 likes · 19 min read
Optimizing Workflow in Data Warehouse Construction: A Task‑Instance Layered Approach
HomeTech
HomeTech
May 20, 2020 · Big Data

AutoHome Data Warehouse Architecture and Layered Model Design

This article describes AutoHome's data warehouse architecture, detailing its background, business pain points, layered model design (RDM, ADM, GDM, FDM, Stage/BDM, DIM, TMP), advantages in performance, cost, efficiency, quality, and various application scenarios including BI, analytics, and decision support.

BIETLdata modeling
0 likes · 10 min read
AutoHome Data Warehouse Architecture and Layered Model Design
Youzan Coder
Youzan Coder
Mar 18, 2020 · Big Data

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

The article traces Youzan’s data warehouse from its chaotic early days lacking structure, through a 2016 Airflow‑driven construction phase that introduced layered ODS/DW/Data Mart architecture and naming standards, to a mature stage focused on efficiency, security, SparkSQL, dimensional modeling, metadata, and ongoing real‑time and governance challenges.

AirflowBig DataData Governance
0 likes · 20 min read
The Evolution of Youzan’s Data Warehouse in a Big Data Environment
Yanxuan Tech Team
Yanxuan Tech Team
Feb 17, 2020 · Big Data

Why Data Warehouses Matter: From Basics to the Hadoop Ecosystem

This article explains the purpose of data as a strategic asset, compares traditional databases with data warehouses, outlines key characteristics and related concepts of data warehouses, and introduces the Hadoop ecosystem components that support large‑scale data storage and analysis.

AnalyticsETLHadoop
0 likes · 14 min read
Why Data Warehouses Matter: From Basics to the Hadoop Ecosystem
58 Tech
58 Tech
Feb 10, 2020 · Big Data

Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com

This article systematically describes the challenges, design principles, modeling methods, layered architecture, implementation steps, and standards used in building a comprehensive user behavior data warehouse for 58.com, highlighting practical experiences and future improvement directions.

Big DataData QualityETL
0 likes · 11 min read
Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com
dbaplus Community
dbaplus Community
Jan 14, 2020 · Big Data

How OPPO Built a Real‑Time Data Warehouse with Flink SQL

This article details{32-64 words} OPPO's evolution from an offline data warehouse to a real‑time platform, describing the business scale, data‑mid platform architecture, migration strategy using Flink SQL, extensions like AthenaX, and practical use cases such as real‑time ETL, CTR calculation, and tag import.

ETLFlinkStreaming
0 likes · 18 min read
How OPPO Built a Real‑Time Data Warehouse with Flink SQL
Architecture Digest
Architecture Digest
Dec 26, 2019 · Databases

Data Warehouse Fundamentals, Modeling Techniques, and the Evolution of Maoyan’s Warehouse

This article explains the origins and challenges of scattered enterprise data, defines the data warehouse concept, details its four core characteristics, compares entity, normalization, and dimensional modeling methods, and illustrates Maoyan’s three‑stage data‑warehouse evolution with practical examples and diagrams.

ETLModelingdata-warehouse
0 likes · 17 min read
Data Warehouse Fundamentals, Modeling Techniques, and the Evolution of Maoyan’s Warehouse
Architecture Digest
Architecture Digest
Dec 24, 2019 · Big Data

Design Architecture and Technical Strategies for Big Data Products

This article systematically outlines the architecture and technical strategy of big‑data product design, detailing a five‑step process from front‑end data collection and ETL to data warehousing, modeling, algorithm design, and personalized user‑centric delivery, while highlighting common platform challenges and future deep‑learning enhancements.

Data ArchitectureETLuser profiling
0 likes · 14 min read
Design Architecture and Technical Strategies for Big Data Products
vivo Internet Technology
vivo Internet Technology
Dec 18, 2019 · Big Data

Comprehensive Overview of Big Data Architecture, Lambda/Kappa Models, and End-to-End Data Platform Design

The article surveys modern big‑data architecture, contrasting Lambda and Kappa models, highlights common governance and integration pain points, and proposes an end‑to‑end platform featuring unified metadata, stream‑batch processing, one‑click ingestion, standardized modeling, intelligent query abstraction, and a comprehensive development IDE.

Big DataData PlatformETL
0 likes · 13 min read
Comprehensive Overview of Big Data Architecture, Lambda/Kappa Models, and End-to-End Data Platform Design
HomeTech
HomeTech
Nov 27, 2019 · Databases

Migrating AutoHome Community from SQL Server to TiDB: Architecture, Testing, and Lessons Learned

This article details the AutoHome community's migration from a monolithic SQL Server database to the distributed TiDB platform, covering the performance bottlenecks that prompted the change, the evaluation of candidate databases, extensive OLTP/OLAP testing, the full‑ and incremental‑sync migration strategy, rollback mechanisms, and the resulting operational improvements.

ETLPerformance TestingSQL Server
0 likes · 15 min read
Migrating AutoHome Community from SQL Server to TiDB: Architecture, Testing, and Lessons Learned
DataFunTalk
DataFunTalk
Nov 19, 2019 · Big Data

Comprehensive Overview of Data Warehouses: Concepts, Evolution, Architecture, and Real‑time vs Offline Practices

This article provides a thorough introduction to data warehouses, traces their evolution, explains construction methodologies, compares offline, Lambda, and Kappa architectures, and presents real‑time warehouse case studies from Alibaba, Meituan, Xiaomi, Netflix, and OPPO, highlighting practical implementation details and challenges.

ETLFlinkKappa architecture
0 likes · 14 min read
Comprehensive Overview of Data Warehouses: Concepts, Evolution, Architecture, and Real‑time vs Offline Practices
Qunar Tech Salon
Qunar Tech Salon
Nov 18, 2019 · Databases

Data Synchronization Architecture and Refactoring for Large-Scale Travel Data at Qunar

This article describes the challenges of handling billions of travel records in Qunar's MySQL databases, compares open‑source data sync solutions like Databus and Canal, outlines the legacy system’s issues, and presents a refactored architecture that introduces Otter, ES gateway, and improved aggregation to achieve low‑latency, reliable, and scalable data synchronization.

ETLElasticsearchKafka
0 likes · 19 min read
Data Synchronization Architecture and Refactoring for Large-Scale Travel Data at Qunar
Architects Research Society
Architects Research Society
Oct 23, 2019 · Big Data

Talend Performance Tuning Strategy: Identifying and Eliminating Bottlenecks

This article presents a structured, repeatable approach for Talend data‑integration jobs that guides readers through pinpointing performance bottlenecks, testing individual pipeline stages, and applying targeted optimizations to sources, targets, and transformations to achieve higher throughput and more reliable ETL processes.

Bottleneck AnalysisData IntegrationETL
0 likes · 9 min read
Talend Performance Tuning Strategy: Identifying and Eliminating Bottlenecks
Mafengwo Technology
Mafengwo Technology
Sep 26, 2019 · Big Data

Mafengwo’s Data Warehouse & Middle Platform: Architecture, Modeling, Toolchain

This article details Mafengwo’s journey in constructing a data warehouse and data middle platform, covering the core three‑layer architecture, hybrid modeling approaches, the supporting toolchain for data synchronization, scheduling, and metadata management, and the design of an indicator platform for business analytics.

Big Data ArchitectureData Middle PlatformETL
0 likes · 18 min read
Mafengwo’s Data Warehouse & Middle Platform: Architecture, Modeling, Toolchain
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 13, 2019 · Big Data

Data Warehouse Overview, Architecture, and Modeling Methodology

This article provides a comprehensive introduction to data warehouses, covering their definition, architectural layers, characteristics, modeling approaches such as Inmon and Kimball, fact and dimension table design, star and snowflake schemas, and best‑practice principles for building scalable, maintainable warehouse solutions in the big‑data ecosystem.

Database designETLModeling
0 likes · 19 min read
Data Warehouse Overview, Architecture, and Modeling Methodology
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 11, 2019 · Big Data

Big Data Technology and Architecture: Case Studies of Taobao, Didi, and Meituan

This article reviews the evolution and key components of big data platforms at leading Chinese internet companies—Taobao, Didi, and Meituan—detailing their data sources, synchronization tools, storage layers, processing engines, and scheduling systems to provide practical guidance for building robust big data infrastructures.

ArchitectureBig DataData Platform
0 likes · 9 min read
Big Data Technology and Architecture: Case Studies of Taobao, Didi, and Meituan
58 Tech
58 Tech
Sep 6, 2019 · Big Data

Architecture and Technical Implementation of the WMDA Data Analytics Platform

The article details WMDA's end‑to‑end data analytics architecture, covering zero‑event data collection, real‑time and offline processing pipelines built on Spark Streaming, Druid, Hadoop, Kettle, and TaskServer, and explains how these components collaborate to deliver comprehensive user behavior analysis.

Big DataDruidETL
0 likes · 11 min read
Architecture and Technical Implementation of the WMDA Data Analytics Platform
dbaplus Community
dbaplus Community
Jul 24, 2019 · Big Data

Essential Open-Source Tools Every Big Data Engineer Should Know

This article compiles a comprehensive list of common open‑source tools for big data platforms—covering programming languages, data collection, ETL, storage, analysis, query, management, and monitoring—to help learners and practitioners quickly locate and understand the technologies they need.

Big DataETLHadoop
0 likes · 15 min read
Essential Open-Source Tools Every Big Data Engineer Should Know
DataFunTalk
DataFunTalk
Jul 1, 2019 · Artificial Intelligence

Data-Driven Foundations for Building Recommendation Systems

The article explains how data serves as a critical asset for recommendation systems, outlining the necessary steps from understanding business problems and data dimensions to collection, cleaning, integration, and analysis, while distinguishing explicit and implicit user feedback and emphasizing data quality, timeliness, and relevance.

Data QualityETLdata collection
0 likes · 11 min read
Data-Driven Foundations for Building Recommendation Systems
Dada Group Technology
Dada Group Technology
Jun 11, 2019 · Big Data

Building and Evolving the Dada‑JD Daojia Big Data Platform: Architecture, Strategies, and Lessons Learned

This article presents a comprehensive case study of the Dada‑JD Daojia big data platform, detailing its evolution from a MySQL‑based warehouse to a multi‑layered One Data, One Platform, One Service, Many Apps architecture, the technical challenges faced, and the strategic approaches adopted to ensure coverage, accuracy, stability, and scalability.

Big DataData GovernanceData Platform
0 likes · 14 min read
Building and Evolving the Dada‑JD Daojia Big Data Platform: Architecture, Strategies, and Lessons Learned
dbaplus Community
dbaplus Community
Feb 28, 2019 · Big Data

How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink

This article details Zhihu's evolution of its real-time data warehouse, covering the 1.0 version built on Spark Streaming, the 2.0 upgrade using Flink Streaming SQL, architectural layers, ETL processes, and future directions such as streaming SQL platformization and automated result validation.

ETLFlinkLambda architecture
0 likes · 19 min read
How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink
dbaplus Community
dbaplus Community
Jan 23, 2019 · Big Data

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

This article explains Zhihu's journey from ad‑hoc MySQL‑Hive sync using Oozie + Sqoop to a unified, platform‑based data synchronization service that now handles thousands of tables, over 10 TB daily, with load‑aware scheduling, incremental pulls, schema change handling, and tight integration with their offline job scheduler.

Big DataDataXETL
0 likes · 14 min read
How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX
Efficient Ops
Efficient Ops
Dec 24, 2018 · Operations

How Baidu’s Noah Platform Unifies Ops Data with Pull, Push, and Lazy ETL

This article explains how Baidu Cloud's Noah intelligent operations product builds a unified operations knowledge base by categorizing metadata, status, and event data and applying three ETL approaches—Pull, Push, and Lazy—to handle offline, near‑line, and real‑time data integration.

Cloud ComputingData IntegrationETL
0 likes · 8 min read
How Baidu’s Noah Platform Unifies Ops Data with Pull, Push, and Lazy ETL
iQIYI Technical Product Team
iQIYI Technical Product Team
Oct 19, 2018 · Backend Development

Design and Implementation of an Operational Backend System Using ETL, Metadata, and Business Object Model

The paper outlines the three‑generation evolution of a video‑platform operational backend—from Apollo to Eight—meeting cross‑business integration, low‑cost development, and user experience goals by employing a metadata‑driven ETL layer, a unified business‑object model, and a componentized UI within a micro‑kernel, plugin‑based architecture, delivering decoupling, rapid configuration, data safety, and dynamically generated pages, while future work expands UI components, source support, deep‑operation features, and PaaS/open‑source release.

ArchitectureETLbusiness object model
0 likes · 12 min read
Design and Implementation of an Operational Backend System Using ETL, Metadata, and Business Object Model
Meitu Technology
Meitu Technology
Aug 14, 2018 · Big Data

Meitu Data Platform Architecture and Practices

Meitu’s data platform, serving dozens of apps with 500 million monthly active users and billions of daily events, combines the Arachnia log‑collection system, Kafka ingestion, multi‑layer storage (HDFS, MongoDB, HBase, Elasticsearch), offline Hive/MapReduce processing and real‑time Storm/Flink/Naix pipelines, supported by data‑workshop tools, staged evolution for scalability, and robust security and query‑validation mechanisms.

Big DataData PlatformETL
0 likes · 16 min read
Meitu Data Platform Architecture and Practices
Architects Research Society
Architects Research Society
Jul 27, 2018 · Big Data

Overview of Apache Hive Features, Usage, and Management

Apache Hive is an open‑source data‑warehouse system built on Hadoop that enables users to read, write, and manage large distributed datasets using SQL‑like queries, offering features such as ETL support, various file‑format connectors, extensible UDFs, and integration with tools like Tez, Spark, and MapReduce.

Apache HiveBig DataETL
0 likes · 5 min read
Overview of Apache Hive Features, Usage, and Management
360 Tech Engineering
360 Tech Engineering
Jul 13, 2018 · Big Data

Titan 2.0 Big Data Processing Platform: Architecture Evolution and Practice

The article describes the evolution of 360's Titan big‑data processing platform through three architectural stages, details its functional modules, explains the DITTO component framework, context and rule‑engine abstractions, and shares practical case studies and personal insights on building a flexible, self‑service data platform.

Big DataDITTOETL
0 likes · 12 min read
Titan 2.0 Big Data Processing Platform: Architecture Evolution and Practice
Hujiang Technology
Hujiang Technology
Jun 20, 2018 · Databases

Deep Dive into Yugong: Architecture, Core Modules, and Custom Enhancements for Database Migration

This article introduces Yugong, an open‑source ETL framework for heterogeneous database migration, explains its core Extractor‑Translator‑Applier architecture, details key classes and interfaces, discusses limitations of the original version, and describes extensive refactoring and new features added to support SQL Server, MySQL, and Canal‑based incremental replication.

ETLOpen-sourceYugong
0 likes · 9 min read
Deep Dive into Yugong: Architecture, Core Modules, and Custom Enhancements for Database Migration
Hujiang Technology
Hujiang Technology
Mar 13, 2018 · Databases

Migrating from SQL Server to MySQL: Strategies, Tools, and Lessons Learned

This article details the background, design considerations, migration workflows, tooling choices, data consistency verification, rollback mechanisms, and practical experiences of moving a large‑scale production environment from Microsoft SQL Server to MySQL, covering both offline and online migration scenarios.

Data ConsistencyETLMySQL
0 likes · 13 min read
Migrating from SQL Server to MySQL: Strategies, Tools, and Lessons Learned
Liulishuo Tech Team
Liulishuo Tech Team
Oct 22, 2017 · Big Data

Data-CI: A SQL-Based Data Unit Testing Framework for ETL

The article introduces data-ci, a SQL‑driven unit testing framework that lets engineers write, organize, and automate data validation tests for ETL pipelines, providing assertions, failure callbacks, coverage reporting, and CI integration to improve data quality and reliability.

Big DataData QualityETL
0 likes · 9 min read
Data-CI: A SQL-Based Data Unit Testing Framework for ETL
21CTO
21CTO
Oct 14, 2017 · Backend Development

How etlpy Simplifies Python Web Scraping and Data Cleaning in Under 500 Lines

etlpy is a lightweight Python framework that lets you define web‑crawling and data‑cleaning pipelines via XML, using generators for streaming, built‑in thread pools for parallelism, and a plug‑in architecture that handles everything from regex parsing to JSON conversion, all within a single 500‑line core file.

ETLGeneratorsWeb Scraping
0 likes · 14 min read
How etlpy Simplifies Python Web Scraping and Data Cleaning in Under 500 Lines
ITPUB
ITPUB
Sep 30, 2017 · Big Data

Designing Scalable Open‑Source ETL Systems: Lessons from Baidu Waimai

This talk details Baidu Waimai's end‑to‑end ETL design, covering demand sources, data flow patterns, multi‑stage system evolution, storage choices, scheduling architecture, configuration‑driven processing, quality monitoring, and how data lineage enables transparent, self‑service data delivery.

Big DataData QualityETL
0 likes · 25 min read
Designing Scalable Open‑Source ETL Systems: Lessons from Baidu Waimai
ITPUB
ITPUB
Sep 29, 2017 · Big Data

Designing an Open ETL System: Baidu Waimai’s Scalable Data Pipeline Practices

In this talk, a Baidu Waimai engineer explains the motivations, requirements, and architectural choices behind their open‑source ETL platform, covering data flow patterns, logical mappings, storage options, scheduling, metadata management, and quality monitoring to achieve scalable, transparent, and explainable data delivery.

Big DataETLScheduling
0 likes · 26 min read
Designing an Open ETL System: Baidu Waimai’s Scalable Data Pipeline Practices
Architecture Digest
Architecture Digest
Sep 2, 2017 · Big Data

Designing a High‑Availability, High‑Efficiency Distributed Scheduling Platform for Big Data

This article examines the principles, features, and implementation details of distributed scheduling for big‑data ETL pipelines, covering decentralised schedulers, host selection strategies, fault‑tolerance, operator abstraction, elasticity, trigger mechanisms, visual monitoring, alarm handling, data fan‑in/fan‑out, parameter consistency, real‑time quality checks, lineage tracking, and field‑level traceability.

Big DataData LineageDistributed Scheduling
0 likes · 23 min read
Designing a High‑Availability, High‑Efficiency Distributed Scheduling Platform for Big Data
Ctrip Technology
Ctrip Technology
Aug 10, 2017 · Big Data

Design and Implementation of Ctrip's Large-Scale Data Platform

This article details the architectural choices, component selection, performance tuning, and team organization behind Ctrip's big‑data platform, covering Kafka, Presto, Elasticsearch, Gobblin, Zeppelin, REST APIs, and job scheduling to achieve scalable, interactive data analysis and visualization.

ETLElasticsearchPresto
0 likes · 18 min read
Design and Implementation of Ctrip's Large-Scale Data Platform
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Jul 28, 2017 · Big Data

How Transwarp Transporter Enables Near‑Real‑Time ETL in Big Data Pipelines

The article introduces Transwarp Transporter, a near‑real‑time ETL tool for TDH 5.x, explains its architecture, visual dashboard, drag‑and‑drop data‑flow design, debugging features, parameter management, and highlights how it empowers business users to achieve fast, reliable data migration in big‑data environments.

Data IntegrationETLTranswarp
0 likes · 7 min read
How Transwarp Transporter Enables Near‑Real‑Time ETL in Big Data Pipelines
Architecture Digest
Architecture Digest
Jul 22, 2017 · Big Data

Popular Big Data Tools and Their Descriptions

This article provides an extensive overview of more than ninety open‑source and commercial big‑data tools—including ETL platforms, resource managers, storage systems, messaging queues, processing engines, and visualization libraries—detailing their core functions, typical use cases, and notable adopters.

AnalyticsBig DataData Integration
0 likes · 26 min read
Popular Big Data Tools and Their Descriptions
Architecture Digest
Architecture Digest
May 25, 2017 · Big Data

Designing Data Warehouse Layers: Principles, Models, and Practical Practices

This article explains why data warehouses should be layered, describes the classic ODS‑DW‑APP model, details each layer’s purpose and implementation techniques, presents an improved layering scheme with dimension and temporary tables, and answers common questions about parallel DWS and DWD processing.

Big DataData ArchitectureETL
0 likes · 17 min read
Designing Data Warehouse Layers: Principles, Models, and Practical Practices
dbaplus Community
dbaplus Community
Apr 17, 2017 · Databases

Mastering Oracle‑to‑MySQL Migration: Tools, Pitfalls, and Performance Tweaks

This article shares practical experiences and step‑by‑step guidance for migrating databases from Oracle to MySQL, covering pre‑migration preparation, target selection, data‑object migration tools such as SQL LOAD, Python scripts, Oracle GoldenGate, MySQL Migration Toolkit and Kettle, handling of views, triggers, stored procedures, data validation techniques, and key MySQL performance parameters.

ETLMySQLOracle
0 likes · 26 min read
Mastering Oracle‑to‑MySQL Migration: Tools, Pitfalls, and Performance Tweaks
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Mar 3, 2017 · Big Data

Boost ETL Performance: Key Tips for Resources, Partitioning & Monitoring

Effective ETL optimization is crucial for data warehouse construction, and this guide outlines three core strategies—ensuring proper resource configuration, leveraging data characteristics for optimal partitioning and bucketing, and monitoring task execution—providing practical principles, pitfalls, and case studies to maximize ETL efficiency.

BucketingETLPartitioning
0 likes · 11 min read
Boost ETL Performance: Key Tips for Resources, Partitioning & Monitoring
Architecture Digest
Architecture Digest
Feb 11, 2017 · Big Data

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

The article describes how LeKe Sports built and continuously upgraded its Hadoop‑based big data platform—from a manual ETL‑to‑Elasticsearch reporting system to a 2.0 architecture featuring Spark Streaming, SQL‑based query layers, Elasticsearch indexing, and cloud‑native storage and backup solutions—to meet rapidly growing PB‑scale data demands.

Big DataData PlatformETL
0 likes · 5 min read
LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture
dbaplus Community
dbaplus Community
Jan 8, 2017 · Big Data

How to Build a Cost‑Effective Data Platform for Small‑to‑Medium Enterprises

This article explains why data platforms are essential for modern SMEs, defines what a data platform is, outlines a four‑step methodology (source definition, analysis theme, ETL processing, and reporting), and shares architectural choices, team structures, common pitfalls, and practical advice for rapid, iterative implementation.

Data ArchitectureData PlatformETL
0 likes · 15 min read
How to Build a Cost‑Effective Data Platform for Small‑to‑Medium Enterprises
Architects' Tech Alliance
Architects' Tech Alliance
Nov 30, 2016 · Big Data

Core Technologies and Challenges of Big Data: ETL, Storage, Analysis, and Cloud Integration

This article examines the core technologies of big data—including data collection, storage, management, analysis, and mining—highlighting architectural challenges, analysis techniques, storage solutions, ETL processes, and the interplay between big data and cloud computing, while emphasizing practical implementation considerations.

Cloud ComputingETLdata analysis
0 likes · 11 min read
Core Technologies and Challenges of Big Data: ETL, Storage, Analysis, and Cloud Integration
ITPUB
ITPUB
Nov 2, 2016 · Databases

Mastering Oracle GoldenGate: Architecture, Components, and Configuration Guide

This article provides a comprehensive overview of Oracle GoldenGate, detailing its supported databases, modular architecture, key components such as Extract, Data Pump, Replicat, Trails, Checkpoints, Manager and Collector, as well as processing types, group configuration, and commit sequence numbers for reliable data replication.

Change Data CaptureETLOracle GoldenGate
0 likes · 20 min read
Mastering Oracle GoldenGate: Architecture, Components, and Configuration Guide
ITPUB
ITPUB
Jul 19, 2016 · Big Data

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

The talk shares hands‑on experiences and best‑practice methods for traditional data‑warehouse processing, public and behavioral data handling in big‑data environments, and practical guidance for migrating legacy warehouses to modern Hadoop‑based platforms, emphasizing data governance, security, and performance optimization.

Big DataData GovernanceETL
0 likes · 13 min read
From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights
Architecture Digest
Architecture Digest
Apr 9, 2016 · Big Data

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

This article describes how Meituan migrated from Hive‑SQL and MapReduce to Spark on YARN, built an interactive Zeppelin‑based development platform, created reusable ETL templates, constructed a Spark‑driven feature and data‑mining platform, and applied Spark to interactive user‑behavior analysis and large‑scale SEM services, highlighting performance gains and operational benefits.

Big DataData PlatformETL
0 likes · 19 min read
Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications
21CTO
21CTO
Apr 4, 2016 · Big Data

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop

This article details Asana's evolution from a simple Python‑MySQL setup to a robust, scalable data platform using Redshift, Hadoop, Luigi, and modern BI tools, highlighting challenges, solutions, and lessons learned for building reliable data pipelines in fast‑growing startups.

Big DataData InfrastructureETL
0 likes · 15 min read
How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop
dbaplus Community
dbaplus Community
Apr 3, 2016 · Big Data

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Beyond

Facing rapid growth, Asana overhauled its data infrastructure—from a single‑machine MySQL setup to a Redshift‑backed warehouse, Hadoop‑based log processing, Luigi orchestration, and self‑service BI tools—highlighting the challenges, solutions, and future plans for scalable, reliable analytics.

Big DataBusiness IntelligenceData Infrastructure
0 likes · 16 min read
How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Beyond
21CTO
21CTO
Nov 4, 2015 · Big Data

Evolution of Dazhong Dianping’s Data Platform (2012‑2014): Key Lessons for Growing Big Data Teams

This article chronicles the step‑by‑step evolution of Dazhong Dianping’s data platform from 2012 to 2014, detailing changes in data models, storage and compute architecture, scheduling, monitoring, and data‑driven applications, offering practical insights for teams building early‑stage big‑data infrastructures.

Big Data ArchitectureData PlatformETL
0 likes · 7 min read
Evolution of Dazhong Dianping’s Data Platform (2012‑2014): Key Lessons for Growing Big Data Teams
ITPUB
ITPUB
May 26, 2015 · Big Data

Step-by-Step Guide to Quickly Install and Configure Hive on Hadoop

This article provides a concise, practical walkthrough for installing and configuring Apache Hive on a Hadoop cluster, covering prerequisite HDFS and MapReduce setup, downloading Hive, extracting files, setting environment variables, configuring XML files, starting Hive, and verifying the installation with simple commands.

ETLHQLHadoop
0 likes · 4 min read
Step-by-Step Guide to Quickly Install and Configure Hive on Hadoop