Riverbed: A Scalable Data Framework for Real‑time and Batch Processing at Airbnb
Airbnb’s Riverbed framework unifies streaming CDC events and batch Spark jobs behind a GraphQL‑based declarative API to automatically build and maintain distributed materialized views, using Kafka‑partitioned ordering and version control to deliver billions of daily updates with low‑latency reads for features such as payments and search.
As Airbnb’s platform grew, the number of databases and the diversity of data types increased, making data access and processing increasingly complex. To address the challenges of a Service‑Oriented Architecture (SOA) with many data‑intensive services, Airbnb created Riverbed, a data framework designed for high‑performance, high‑availability reads.
Riverbed was motivated by a common query pattern that spans multiple data sources, involves complex business logic, and requires difficult‑to‑optimize data transformations. The framework abstracts the construction and management of distributed materialized views, similar to a Lambda architecture, providing a declarative GraphQL‑based interface for engineers to define queries that are executed both online (real‑time) and offline (batch).
The design consists of two main components: a streaming system and a batch processing system. The streaming system consumes Change‑Data‑Capture (CDC) events, converts them into notification triggers linked to specific document IDs, deduplicates them, and writes them to Kafka. A second flow consumes these notifications, performs joins and user‑defined transformations, and writes the resulting documents to the designated receivers, ensuring eventual consistency.
The batch system handles missing events and CDC failures by identifying changes relevant to materialized view documents and using Apache Spark to back‑fill data from offline data warehouse snapshots. Spark SQL queries are generated from the same GraphQL definitions used by the streaming path, allowing reuse of business logic.
To avoid race conditions in a distributed environment, Riverbed serializes all document changes through Kafka topics partitioned by document ID, guaranteeing ordered processing. Version control is enforced at the receiver side using timestamp‑based versions, preventing conflicts between streaming and batch writes.
Since its deployment, Riverbed processes roughly 2.4 billion events per day, writes 350 million documents, and powers over 50 materialized views supporting features such as payments, message search, and comment rendering on listing pages.
In summary, Riverbed provides a scalable, high‑performance data framework that simplifies the creation and management of distributed materialized views, improves latency for read‑heavy workloads, and enables rapid product iteration at Airbnb.
Airbnb Technology Team
Official account of the Airbnb Technology Team, sharing Airbnb's tech innovations and real-world implementations, building a world where home is everywhere through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.