Big Data 12 min read

Designing an Agile Data Warehouse and Data Platform for Internet Companies

The article outlines the purposes, architecture, data ingestion, storage, analysis, sharing, application, real‑time processing, scheduling, monitoring, and best‑practice recommendations for building a fast, flexible, and reliable big‑data platform in the fast‑changing internet industry.

Architect

Oct 17, 2015

Designing an Agile Data Warehouse and Data Platform for Internet Companies

We are all architects! This informal discussion, based on my experience in the internet industry, lists the typical uses of a data warehouse and data platform, such as integrating all business data, providing various reports, supporting operations, enabling data‑driven products, and offering open data services.

Overall Architecture

The diagram below shows a typical data platform architecture, which is similar across many companies. Logically it consists of a data collection layer, a storage and analysis layer, a data sharing layer, and a data application layer.

Data Collection

The collection layer gathers data from various sources and stores it in HDFS, optionally performing light cleaning.

Website logs : Deployed Flume agents on log servers send logs to HDFS in real time.

Business databases : MySQL, Oracle, SQL Server, etc. can be synchronized to HDFS using tools like DataX (an open‑source alternative to Sqoop) or Flume.

FTP/HTTP sources : Periodic data from partners can also be fetched via DataX.

Other sources : Manual data can be ingested through simple APIs or small applications.

Storage and Analysis

HDFS is the de‑facto storage solution for big‑data warehouses. For offline analysis, Hive is preferred because of its rich data types, built‑in functions, ORC compression, and SQL‑like interface, which dramatically reduces development effort compared with raw MapReduce.

For more performance, Spark and SparkSQL are used on top of YARN, allowing seamless integration without a separate Spark cluster. Real‑time computation is covered later.

Data Sharing

After analysis, results are stored in relational or NoSQL databases so that downstream applications can access them. DataX can also move processed data from HDFS to these target stores, and some real‑time results are written directly to the sharing layer.

Data Application

Business products : Consume data from the sharing layer.

Reports : Use pre‑aggregated data stored in the sharing layer.

Ad‑hoc queries : Users (developers, analysts, managers) run SQL queries directly against the storage layer; SparkSQL is recommended for better response time than Hive.

OLAP : When data volume exceeds relational database capacity, custom solutions fetch data from HDFS or HBase for multidimensional analysis.

Other data interfaces : Example – a Redis‑based service provides user attributes to all business services.

Real‑time Computation

Increasing business demand for low‑latency data leads to the adoption of distributed, high‑throughput frameworks. Although Storm is mature, we chose Spark Streaming for its simplicity and comparable latency.

Our implementation collects website and advertising logs via Flume, streams them to Spark Streaming, aggregates statistics, and writes the results to Redis for real‑time consumption.

Task Scheduling and Monitoring

The platform runs many tasks (collection, synchronization, analysis) with complex dependencies, requiring a robust scheduler and monitoring system that orchestrates execution and alerts on failures.

Metadata Management

Comprehensive metadata management is costly and of limited value for us; we only retain daily task‑run metadata.

Conclusion

In my view, an architecture should prioritize simplicity, stability, and business focus over adding the latest technologies. Our data platform enables developers to concentrate on business logic, using simple SQL jobs and a reliable scheduling system, while alerts keep operations aware of issues.

Related reading: Big‑data platform task scheduling and monitoring, Spark on YARN series, Taobao DataX tool.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse Spark Hadoop data ingestion

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.