Big Data 11 min read

Building a Scalable Big Data Platform on AWS: Architecture and Execution Service Design

This article details the architectural design and implementation of a scalable big data platform built on AWS services, highlighting the transition from HDFS to S3 for storage, the use of EMR for elastic compute, and a custom Execution Service integrated with Consul and Airflow for automated cluster management and task scheduling.

Liulishuo Tech Team

Jun 17, 2016

Building a Scalable Big Data Platform on AWS: Architecture and Execution Service Design

As big data products mature and cloud services become more accessible, the operational overhead for data platforms has decreased significantly. This article outlines how Liulishuo constructed its big data platform using Amazon Web Services (AWS), focusing on architecture, compute management, and task scheduling.

Data Platform Architecture

The platform utilizes Amazon S3 as the primary storage layer, replacing Hadoop HDFS for final data persistence to facilitate seamless data access and cluster scalability. HDFS is retained only to support MapReduce/YARN operations. Multiple independent EMR (Elastic MapReduce) clusters share S3 storage while maintaining isolated compute resources.

Amazon EMR enables rapid provisioning of Hadoop, Spark, and Hive clusters. Since data persists in S3, clusters are destroyed immediately after daily T+1 ETL jobs complete, optimizing costs. Instance types are selected based on specific workload requirements, such as high compute for recommendation algorithms or high memory for ETL tasks.

Consul handles service registration and discovery. EMR Master nodes and the Execution Service register with Consul, allowing the scheduling system to request compute resources without hardcoding cluster addresses.

The Execution Service (ES) manages EMR cluster lifecycle (creation, scaling, termination) and orchestrates daily task submissions for Bash, HQL, and Spark. All operations and metadata are stored in MySQL.

Airflow schedules all ETL workflows, with dependencies defined in Python. Data is synchronized from MongoDB/Redis to S3, structured in Hive, and transformed from raw JSON into optimized ORC or Parquet formats. Airflow also manages the cluster lifecycle: Start → CreateCluster → Job Submission → TerminateCluster → End.

Presto is deployed for interactive querying, overcoming Hive's latency for ad-hoc analysis. A custom Presto UI and Sublime plugin were developed to enhance usability, provide syntax highlighting, and enforce strict access controls for data analysts and product managers.

Execution Service System Architecture

ES operates in a Master-Slave architecture with multiple Executor nodes. The Master manages task states and metadata, while Executors pull tasks from the database using a preemptive design. Task scripts and dependencies are packaged and uploaded to S3.

During task submission, clients query Consul for an available ES Master, request a Job ID, and upload scripts. Executors fetch tasks, initialize appropriate Runners, and handle execution. For Spark/Hive tasks, Runners download packages from S3, generate submission commands, push them to the EMR Master node via SSH, and stream real-time logs back to the client.

Cluster creation involves specifying node types, counts, and IAM roles. The Runner initiates the EMR cluster, waits for a waiting state, and executes bootstrap scripts to configure Consul registration, Hive metadata, and YARN parameters before restarting services.

The platform supports dynamic cluster resizing to adjust Core, Task, or Master node counts based on workload demands, such as scaling memory-optimized instances for heavy ETL jobs. This process completes within minutes.

For direct interaction, a CLI tool is provided. Users can organize Hive scripts in a directory and submit them using commands like bin/submit_task -d hive_job/ -t hive -u nobody -p. Example script content includes standard HiveQL commands such as show databases;, use temp;, and show tables;.

Conclusion

The core modules and Execution Service architecture provide a stable foundation for Liulishuo's data platform. As daily data volumes continue to grow, future efforts will focus on expanding overall compute capacity and optimizing cluster resource utilization through advanced exploration and practical implementation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering System Design Task scheduling cloud storage Big Data Architecture presto Airflow AWS EMR

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.