Big Data 10 min read

Designing an Agile Data Warehouse Architecture for Internet Companies

The article outlines a practical, end‑to‑end data platform architecture for internet businesses, covering data collection, storage and analysis, sharing, real‑time processing, task scheduling, and the importance of simplicity and agility in building an agile data warehouse.

Architect

Dec 2, 2015

Overall Architecture

The author presents a typical data platform architecture used by many internet companies, consisting of four logical layers: data collection, data storage & analysis, data sharing, and data application.

Data Collection

This layer gathers data from various sources such as website logs, business databases (MySQL, Oracle, SQL Server), FTP/HTTP feeds, and manual inputs, using tools like Flume, DataX, and Sqoop to move data into HDFS.

Data Storage and Analysis

HDFS is the primary storage solution. Offline analysis relies on Hive for its SQL support and ORC compression, while Spark and SparkSQL are recommended for faster processing and integration with YARN. Real‑time computation is handled with Spark Streaming.

Data Sharing

After analysis, results are stored in relational or NoSQL databases (e.g., Hive, HBase, Redis) to make them accessible to downstream services, with DataX used to synchronize data from HDFS to these targets.

Data Application

Applications, reports, ad‑hoc queries, OLAP tools, and custom APIs consume data from the sharing layer; SparkSQL is suggested for low‑latency ad‑hoc queries, while Impala is an alternative.

Real‑time Computing

For low‑latency needs such as website traffic and ad‑effect statistics, the author uses Flume to feed logs into Spark Streaming, which aggregates data and writes results to Redis for real‑time access.

Task Scheduling and Monitoring

The platform includes a comprehensive scheduler and monitoring system to manage data collection, synchronization, and analysis jobs, handling complex dependencies between tasks.

Metadata Management

The author notes that metadata management is complex and currently limited to daily task metadata, deeming it low‑priority.

Summary

The key takeaway is that a data platform should prioritize simplicity, stability, and business‑centric development; most work reduces to straightforward SQL development and scheduling, allowing teams to focus on business value rather than a proliferation of technologies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse Spark Hadoop Data Architecture

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.