Databases 11 min read

How LinkedIn Scales Data: Inside Its Multi‑Phase Database Architecture

This article explains how LinkedIn manages real‑time profile updates, news feeds, and social graph data through a three‑phase architecture that combines RDBMS, NoSQL stores, caching, and Lucene indexes to achieve high consistency, availability, and partition tolerance at massive scale.

21CTO

Jan 28, 2016

How LinkedIn Scales Data: Inside Its Multi‑Phase Database Architecture

LinkedIn.com Data Use Cases

When you update your profile, the changes appear almost instantly on the recruitment search page, the connections page, the news feed, and other read‑only pages such as "People You May Know" or "Related Searches".

With a good broadband connection these pages can load in a few milliseconds.

Early LinkedIn Data Architecture

Initially LinkedIn stored user profiles and connections in a few tables of a single relational database system (RDBMS). Later two additional databases were added: one for full‑text search of profiles and another for the social graph. Both were kept up‑to‑date by a change‑capture system called Databus, which captures changes from a trusted source (e.g., Oracle) and propagates them to the auxiliary databases.

That architecture soon became insufficient. According to Brewer's CAP theorem, simultaneously achieving consistency, availability, and partition tolerance is impossible, so LinkedIn engineers adopted "timeline consistency" (near‑line eventual consistency) together with high availability and partition tolerance.

LinkedIn’s Current Data Architecture

To support millions of user transactions in under a second, LinkedIn uses a three‑phase architecture consisting of online, offline, and near‑line systems. The data stores include:

RDBMS

Oracle

MySQL (underlying storage for Espresso)

NoSQL and other stores

Espresso (LinkedIn’s document‑oriented NoSQL store)

Voldemart (distributed key‑value store)

HDFS (stores data for Hadoop MapReduce jobs)

Caching

Memcached

Lucene‑based indexes

Search indexes used by SeaS (Search‑as‑a‑Service)

Social‑graph indexes

Member‑profile indexes accessed via read‑replica sets

Online Database Systems

The online system handles real‑time user interactions. The primary database (Oracle) processes writes and a small number of reads. LinkedIn is developing Espresso, a horizontally scalable, document‑oriented NoSQL store that aims to replace much of the Oracle workload, especially for InMail messaging.

Applications using Espresso include member messaging, social actions (updates), article sharing, user and company profiles, and news articles.

Offline Database Systems

The offline layer consists of Hadoop and a Teradata data warehouse for batch processing and analytics. Apache Azkaban schedules Hadoop and ETL jobs, which read from the trusted source, run MapReduce, store results in HDFS, and notify consumers such as Voldemart to ingest the data and refresh indexes.

Near‑Line Database Systems (Timeline Consistency)

Near‑line systems provide eventual consistency for read‑only features like "People You May Know", "People Also Viewed", and related searches. Voldemart serves these pages using data sourced from Hadoop. Databus updates several indexes:

Member search index for SeaS

Social‑graph index for relationship visualizations

Member profile data accessed via read‑replica sets

Data‑Use Case Walkthrough

When you update your profile with new skills or a new position and accept a connection request, the following occurs:

The change is written to the Oracle master database.

Databus captures the change and propagates it to:

The standardization service.

The search index service.

The graph index service.

Architecture Lessons

To design a LinkedIn‑like data architecture that offers consistency, high scalability, and high availability, consider these guidelines:

Separate read and write databases: a trusted source for writes and derived databases for reads.

Derived databases can be built on Lucene indexes or NoSQL stores such as Voldemart, Redis, Cassandra, MongoDB.

Ensure read‑side data is refreshed by either dual writes at the application layer or by log‑based extraction from the trusted source.

Use Hadoop MapReduce to create derived data sets, store results in HDFS, and notify downstream stores.

Design storage clusters as distributed systems with sharding and replication to achieve horizontal scalability.

Employ cluster management tools like Apache Helix to maximize uptime of distributed stores.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems NoSQL Data Architecture LinkedIn

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.