How Netflix Scales Its Hadoop Data Warehouse on AWS with Genie PaaS
This article explains how Netflix leverages Amazon S3 and Elastic MapReduce to build a virtually unlimited, dynamically scalable Hadoop data warehouse in the cloud, and introduces Genie—a Hadoop platform‑as‑a‑service that abstracts job submission, resource management, and cluster orchestration.
Cloud: Hadoop Platform as a Service
Hadoop’s ability to manage and process hundreds of terabytes to petabytes of data has become the de‑facto standard. Netflix operates a petabyte‑scale Hadoop data warehouse in the cloud, using elastic resources to achieve virtually unlimited scale for both data processing and compute.
In this article, Sriram and Eva discuss how Netflix’s cloud‑based data warehouse differs from traditional data‑center Hadoop architectures and how they use elastic cloud services to build a dynamically scalable system. They also introduce Genie, Netflix’s own Hadoop PaaS that provides a REST‑ful API for job execution and resource management.
Architecture Overview
Traditional data‑center Hadoop stores data on HDFS, which runs on commodity hardware. In the cloud, Netflix stores all data in Amazon S3 instead of HDFS, a core principle of their architecture. The overall view is illustrated below.
Using S3 as the Cloud Data Warehouse
S3 serves as the true source of Netflix’s cloud data warehouse. All valuable datasets—including logs from TVs, PCs, and mobile devices captured by the Ursula pipeline, as well as dimensional data from Cassandra—are stored in S3.
Why S3 instead of HDFS? It offers 99.999999999% durability, 99.99% availability, versioned storage for easy recovery, elastic virtually unlimited scaling from TB to PB, and enables multiple dynamic clusters to share the same data without costly replication.
Although S3’s read/write latency is higher than HDFS, most workloads are multi‑stage MapReduce jobs: mappers read input directly from S3, reducers write final output back to S3, while intermediate data resides on HDFS or local disks, mitigating performance impact.
Multiple Hadoop Clusters for Different Workloads
Netflix runs Amazon Elastic MapReduce (EMR) with S3 as the data store, allowing elastic configuration of multiple Hadoop clusters for various workloads. A large >500‑node query cluster serves engineers, data scientists, and analysts for ad‑hoc queries, while a similarly sized “product” (SLA) cluster handles SLA‑driven ETL jobs. Additional dev clusters exist as needed.
Clusters are dynamically resized: query clusters shrink at night, product clusters expand to handle nightly ETL. Because all clusters access the same S3 data, there is no need for data replication when scaling up or down.
Tools and Gateways
Developers use Hive for querying, Pig for ETL, and occasionally Java MapReduce for complex algorithms; Python scripts are common for custom ETL and Pig UDFs. Access to Hadoop clusters is mediated through “gateways”—cloud instances that provide CLI access to Hadoop, Hive, and Pig. Heavy users are encouraged to run personal gateway AMIs to avoid contention.
Genie – Hadoop Platform as a Service
Amazon provides Hadoop IaaS via EMR. Netflix built Genie, a Hadoop PaaS that abstracts away cluster provisioning and client installation. Genie offers a REST‑ful API for submitting Hadoop, Hive, or Pig jobs and a Configuration Service that stores metadata about clusters and their capabilities.
Why Genie?
Netflix’s ETL processes are loosely coupled, spanning cloud and data‑center resources, and require a system that can submit jobs to multiple clusters without installing a full Hadoop stack on the client. Existing open‑source solutions (Oozie, Templeton) did not meet Netflix’s needs for multi‑cluster, multi‑framework job submission.
Genie Architecture
Genie consists of two main services:
Execution Service: receives job submissions (JSON/XML) specifying job type (Hadoop/Hive/Pig), command‑line arguments, file dependencies (scripts, JARs on S3), schedule type (ad‑hoc or SLA), and Hive metastore name.
Configuration Service: tracks active clusters, supported schedules, and configuration files (mapred‑site.xml, core‑site.xml, hdfs‑site.xml, hive‑site.xml). Clusters are marked UP, TERMINATED, or OUT‑OF‑SERVICE to manage lifecycle and load balancing.
When a job is submitted, the Execution Service queries the Configuration Service to map the job to an appropriate cluster, optionally applying custom load‑balancing logic.
Dynamic Resource Management with Genie
Netflix teams run services on AWS Reserved Instances within Auto‑Scaling Groups (ASGs). During off‑peak hours, surplus reserved instances are borrowed by other production clusters via the Configuration Service, and returned when needed. This eliminates the need for rolling upgrades; clusters can be taken out of service, upgraded, and replaced without disrupting running jobs.
Genie is deployed in production across 6‑12 node ASGs spanning three Availability Zones, registered with Eureka for service discovery. It can handle thousands of concurrent job submissions, with the ability to scale to tens of thousands by adding more ASG instances.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
