How LinkedIn Scales Data: Inside Its Multi‑Phase Database Architecture
This article explains how LinkedIn manages real‑time profile updates, news feeds, and social graph data through a three‑phase architecture that combines RDBMS, NoSQL stores, caching, and Lucene indexes to achieve high consistency, availability, and partition tolerance at massive scale.
LinkedIn.com Data Use Cases
When you update your profile, the changes appear almost instantly on the recruitment search page, the connections page, the news feed, and other read‑only pages such as "People You May Know" or "Related Searches".
With a good broadband connection these pages can load in a few milliseconds.
Early LinkedIn Data Architecture
Initially LinkedIn stored user profiles and connections in a few tables of a single relational database system (RDBMS). Later two additional databases were added: one for full‑text search of profiles and another for the social graph. Both were kept up‑to‑date by a change‑capture system called Databus, which captures changes from a trusted source (e.g., Oracle) and propagates them to the auxiliary databases.
That architecture soon became insufficient. According to Brewer's CAP theorem, simultaneously achieving consistency, availability, and partition tolerance is impossible, so LinkedIn engineers adopted "timeline consistency" (near‑line eventual consistency) together with high availability and partition tolerance.
LinkedIn’s Current Data Architecture
To support millions of user transactions in under a second, LinkedIn uses a three‑phase architecture consisting of online, offline, and near‑line systems. The data stores include:
RDBMS
Oracle
MySQL (underlying storage for Espresso)
NoSQL and other stores
Espresso (LinkedIn’s document‑oriented NoSQL store)
Voldemart (distributed key‑value store)
HDFS (stores data for Hadoop MapReduce jobs)
Caching
Memcached
Lucene‑based indexes
Search indexes used by SeaS (Search‑as‑a‑Service)
Social‑graph indexes
Member‑profile indexes accessed via read‑replica sets
Online Database Systems
The online system handles real‑time user interactions. The primary database (Oracle) processes writes and a small number of reads. LinkedIn is developing Espresso, a horizontally scalable, document‑oriented NoSQL store that aims to replace much of the Oracle workload, especially for InMail messaging.
Applications using Espresso include member messaging, social actions (updates), article sharing, user and company profiles, and news articles.
Offline Database Systems
The offline layer consists of Hadoop and a Teradata data warehouse for batch processing and analytics. Apache Azkaban schedules Hadoop and ETL jobs, which read from the trusted source, run MapReduce, store results in HDFS, and notify consumers such as Voldemart to ingest the data and refresh indexes.
Near‑Line Database Systems (Timeline Consistency)
Near‑line systems provide eventual consistency for read‑only features like "People You May Know", "People Also Viewed", and related searches. Voldemart serves these pages using data sourced from Hadoop. Databus updates several indexes:
Member search index for SeaS
Social‑graph index for relationship visualizations
Member profile data accessed via read‑replica sets
Data‑Use Case Walkthrough
When you update your profile with new skills or a new position and accept a connection request, the following occurs:
The change is written to the Oracle master database.
Databus captures the change and propagates it to:
The standardization service.
The search index service.
The graph index service.
Architecture Lessons
To design a LinkedIn‑like data architecture that offers consistency, high scalability, and high availability, consider these guidelines:
Separate read and write databases: a trusted source for writes and derived databases for reads.
Derived databases can be built on Lucene indexes or NoSQL stores such as Voldemart, Redis, Cassandra, MongoDB.
Ensure read‑side data is refreshed by either dual writes at the application layer or by log‑based extraction from the trusted source.
Use Hadoop MapReduce to create derived data sets, store results in HDFS, and notify downstream stores.
Design storage clusters as distributed systems with sharding and replication to achieve horizontal scalability.
Employ cluster management tools like Apache Helix to maximize uptime of distributed stores.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
