Big Data 11 min read

Inside Airbnb’s Massive Big Data Platform: Architecture, Lessons & Scaling Secrets

Airbnb’s engineering team outlines the evolution of its big‑data platform, detailing the philosophy behind its architecture, the dual “gold” and “silver” Hive clusters, migration to Mesos, use of Presto, Airpal, Airflow, and the performance and cost gains achieved through these design choices.

21CTO
21CTO
21CTO
Inside Airbnb’s Massive Big Data Platform: Architecture, Lessons & Scaling Secrets

Airbnb, founded in August 2008, has grown its customer service and user community, leading to explosive growth in its big‑data platform.

The article, authored by Airbnb engineer James Mayfield, analyzes the architecture of Airbnb’s big‑data platform, sharing insights and implementation details.

Part 1: Philosophy Behind the Big‑Data Architecture

Airbnb promotes data‑driven decision making, collecting metrics, validating hypotheses through experiments, building machine‑learning models, and uncovering business opportunities to sustain rapid, flexible growth.

After multiple iterations, the big‑data stack has become stable, reliable, and scalable. The company emphasizes using open‑source projects, contributing back to the community, and adopting standard components rather than reinventing the wheel.

Engage with the open‑source community and give back when possible.

Prefer standard components and methods over building custom solutions.

Ensure platform scalability to handle explosive data growth.

Listen to feedback from data users to guide architectural roadmaps.

Reserve excess resources to accommodate future data‑warehouse scaling.

Part 2: Architecture Overview

Below is a high‑level diagram of the platform.

Data sources include event logs sent to Kafka and MySQL dumps stored in AWS RDS, transferred to the “gold” Hive cluster via Sqoop.

The “gold” cluster stores raw data, which is then copied to the “silver” cluster for downstream analytics. The “silver” cluster serves as a superset, handling low‑latency queries and reporting.

Airbnb uses Presto to query Hive tables, replacing traditional data‑warehouse solutions, and plans to connect Presto directly to Tableau.

Key tools:

Airpal – an open‑source web UI for ad‑hoc Presto SQL queries, used by over one‑third of employees.

Airflow – a scheduler that runs jobs across Hive, Presto, Spark, MySQL, etc.

Spark – preferred for machine‑learning and stream processing.

S3 – stores a portion of data formerly on HDFS, reducing storage costs.

Part 3: Hadoop Cluster Evolution

Airbnb migrated to separate “gold” and “silver” clusters, later moving Hadoop workloads to Amazon EMR on EC2. Today it operates two independent HDFS clusters holding 11 PB of data, with several petabytes on S3.

Major challenges and solutions:

A) Running Hadoop on Mesos

Issues: invisible job logs, health monitoring, MR1‑only support, task‑tracker performance problems, high load, Kerberos incompatibility.

Solution: adopt proven solutions from larger companies instead of building custom ones.

B) Remote read/write latency

All HDFS data resides on EBS, causing network‑bound reads/writes.

Solution: use instances with local storage and run on single nodes.

C) Heterogeneous workload placement

Hive/Hadoop are storage‑intensive, while Presto/Spark are compute‑intensive.

Solution: after moving to Mesos, run different workloads on appropriately sized instances (e.g., Spark on AWS d2.8xlarge with local disks), saving over $100 M in three years.

D) HDFS Federation

Earlier “Pinky” and “Brain” clusters required queries on both, leading to instability.

Solution: migrate data to separate HDFS nodes for machine‑level isolation and easier disaster recovery.

E) Heavy monitoring burden

Custom monitoring and alerting for Hadoop, Hive, and HDFS were costly.

Solution: partner with Cloudera for expert support and use its Manager tool to reduce operational overhead.

Final Statement

After evaluating legacy system inefficiencies, Airbnb performed a seamless migration of petabyte‑scale data and thousands of jobs, with ongoing articles and open‑source tools planned for the community.

Performance and cost improvements include:

Disk read/write speed increased from 70–150 MB/s to over 400 MB/s.

Hive job CPU time doubled.

Read throughput tripled.

Write throughput doubled.

Overall cost reduced by 70 %.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataPrestoData ArchitectureAirbnbAirflow
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.