Inside Airbnb’s Massive Big Data Platform: Architecture, Lessons & Scaling Secrets
Airbnb’s engineering team outlines the evolution of its big‑data platform, detailing the philosophy behind its architecture, the dual “gold” and “silver” Hive clusters, migration to Mesos, use of Presto, Airpal, Airflow, and the performance and cost gains achieved through these design choices.
Airbnb, founded in August 2008, has grown its customer service and user community, leading to explosive growth in its big‑data platform.
The article, authored by Airbnb engineer James Mayfield, analyzes the architecture of Airbnb’s big‑data platform, sharing insights and implementation details.
Part 1: Philosophy Behind the Big‑Data Architecture
Airbnb promotes data‑driven decision making, collecting metrics, validating hypotheses through experiments, building machine‑learning models, and uncovering business opportunities to sustain rapid, flexible growth.
After multiple iterations, the big‑data stack has become stable, reliable, and scalable. The company emphasizes using open‑source projects, contributing back to the community, and adopting standard components rather than reinventing the wheel.
Engage with the open‑source community and give back when possible.
Prefer standard components and methods over building custom solutions.
Ensure platform scalability to handle explosive data growth.
Listen to feedback from data users to guide architectural roadmaps.
Reserve excess resources to accommodate future data‑warehouse scaling.
Part 2: Architecture Overview
Below is a high‑level diagram of the platform.
Data sources include event logs sent to Kafka and MySQL dumps stored in AWS RDS, transferred to the “gold” Hive cluster via Sqoop.
The “gold” cluster stores raw data, which is then copied to the “silver” cluster for downstream analytics. The “silver” cluster serves as a superset, handling low‑latency queries and reporting.
Airbnb uses Presto to query Hive tables, replacing traditional data‑warehouse solutions, and plans to connect Presto directly to Tableau.
Key tools:
Airpal – an open‑source web UI for ad‑hoc Presto SQL queries, used by over one‑third of employees.
Airflow – a scheduler that runs jobs across Hive, Presto, Spark, MySQL, etc.
Spark – preferred for machine‑learning and stream processing.
S3 – stores a portion of data formerly on HDFS, reducing storage costs.
Part 3: Hadoop Cluster Evolution
Airbnb migrated to separate “gold” and “silver” clusters, later moving Hadoop workloads to Amazon EMR on EC2. Today it operates two independent HDFS clusters holding 11 PB of data, with several petabytes on S3.
Major challenges and solutions:
A) Running Hadoop on Mesos
Issues: invisible job logs, health monitoring, MR1‑only support, task‑tracker performance problems, high load, Kerberos incompatibility.
Solution: adopt proven solutions from larger companies instead of building custom ones.
B) Remote read/write latency
All HDFS data resides on EBS, causing network‑bound reads/writes.
Solution: use instances with local storage and run on single nodes.
C) Heterogeneous workload placement
Hive/Hadoop are storage‑intensive, while Presto/Spark are compute‑intensive.
Solution: after moving to Mesos, run different workloads on appropriately sized instances (e.g., Spark on AWS d2.8xlarge with local disks), saving over $100 M in three years.
D) HDFS Federation
Earlier “Pinky” and “Brain” clusters required queries on both, leading to instability.
Solution: migrate data to separate HDFS nodes for machine‑level isolation and easier disaster recovery.
E) Heavy monitoring burden
Custom monitoring and alerting for Hadoop, Hive, and HDFS were costly.
Solution: partner with Cloudera for expert support and use its Manager tool to reduce operational overhead.
Final Statement
After evaluating legacy system inefficiencies, Airbnb performed a seamless migration of petabyte‑scale data and thousands of jobs, with ongoing articles and open‑source tools planned for the community.
Performance and cost improvements include:
Disk read/write speed increased from 70–150 MB/s to over 400 MB/s.
Hive job CPU time doubled.
Read throughput tripled.
Write throughput doubled.
Overall cost reduced by 70 %.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
