Designing an Offline Big Data Processing Architecture Based on Object Storage
This article presents a comprehensive offline big‑data processing framework that leverages scalable object storage for PB‑level data, details storage and compute engine requirements, compares cost options, describes data pipeline design, and showcases an e‑commerce case study with Spark‑driven analytics.
The Entropy‑Simple big‑data system has already processed 3.7 PB of data from over 2,000 sources, including macro‑economic, e‑commerce, and textual data, highlighting the need for a robust offline processing architecture.
Key requirements for data storage include massive monthly growth (TB‑scale), high I/O throughput, support for incremental writes, and reliability. For compute, the engine must handle PB‑scale workloads, finish daily or monthly updates within hours, be fault‑tolerant, and have low development cost.
Object storage is introduced as a horizontally scalable solution, with AWS S3 and Alibaba OSS as examples. It offers near‑infinite capacity, 11‑nine durability, high performance, low cost, and broad ecosystem support (e.g., Hadoop s3a). Open‑source MinIO provides compatible S3 APIs for private deployments.
A cost comparison among OSS, Table Storage, NAS, and standard cloud disks shows object storage to be the most cost‑effective, especially with low‑frequency and archival tiers.
The data pipeline connects raw data sources (relational databases, Kafka, object storage) to the compute engine. While relational databases struggle with massive unstructured data, Kafka excels at streaming but is costly for offline storage; object storage offers a balanced solution with high concurrency and simple operations.
Intermediate storage traditionally uses HDFS, but object storage can replace it, providing better I/O, no hierarchy limitations, and near‑zero operational overhead.
The overall architecture uses Apache Spark + object storage: spiders crawl data into object storage, Airflow schedules Spark jobs that read raw data, write intermediate results back to object storage, and finally load processed data into Elasticsearch for front‑end consumption.
Detailed object‑storage design includes storing raw JSON data compressed with Snappy, managing paths via key prefixes (e.g., web/comment/2020/01/01/00/02/00/ ), controlling file size around 64 MB, and using time‑based cursors for incremental processing.
Intermediate data is stored in Parquet format for high compression and Spark compatibility, with similar path and size management strategies.
Spark task development incorporates time‑cursor tracking in MySQL, stepwise processing (raw ingestion, price handling, sales aggregation, attribute enrichment), and pre‑processing to convert daily JSON increments to Parquet, reducing downstream workload.
The e‑commerce case study demonstrates monthly sales, price, and attribute aggregation across billions of SKUs, achieving full‑pipeline processing of 1 TB of data within 16 hours on a 100‑node cluster (4 CPU × 32 GB each).
In conclusion, the proposed object‑storage‑centric offline big‑data solution delivers low cost, low maintenance, high scalability, and reliable performance for PB‑level heterogeneous data sources.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.