Designing an Agile Data Warehouse Architecture for Internet Companies
The article outlines a practical, end‑to‑end data platform architecture for internet businesses, covering data collection, storage and analysis, sharing, real‑time processing, task scheduling, and the importance of simplicity and agility in building an agile data warehouse.
Overall Architecture
The author presents a typical data platform architecture used by many internet companies, consisting of four logical layers: data collection, data storage & analysis, data sharing, and data application.
Data Collection
This layer gathers data from various sources such as website logs, business databases (MySQL, Oracle, SQL Server), FTP/HTTP feeds, and manual inputs, using tools like Flume, DataX, and Sqoop to move data into HDFS.
Data Storage and Analysis
HDFS is the primary storage solution. Offline analysis relies on Hive for its SQL support and ORC compression, while Spark and SparkSQL are recommended for faster processing and integration with YARN. Real‑time computation is handled with Spark Streaming.
Data Sharing
After analysis, results are stored in relational or NoSQL databases (e.g., Hive, HBase, Redis) to make them accessible to downstream services, with DataX used to synchronize data from HDFS to these targets.
Data Application
Applications, reports, ad‑hoc queries, OLAP tools, and custom APIs consume data from the sharing layer; SparkSQL is suggested for low‑latency ad‑hoc queries, while Impala is an alternative.
Real‑time Computing
For low‑latency needs such as website traffic and ad‑effect statistics, the author uses Flume to feed logs into Spark Streaming, which aggregates data and writes results to Redis for real‑time access.
Task Scheduling and Monitoring
The platform includes a comprehensive scheduler and monitoring system to manage data collection, synchronization, and analysis jobs, handling complex dependencies between tasks.
Metadata Management
The author notes that metadata management is complex and currently limited to daily task metadata, deeming it low‑priority.
Summary
The key takeaway is that a data platform should prioritize simplicity, stability, and business‑centric development; most work reduces to straightforward SQL development and scheduling, allowing teams to focus on business value rather than a proliferation of technologies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
