Big Data 25 min read

Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale

This article explores Uber’s sophisticated real‑time data infrastructure, detailing how the company leverages open‑source technologies such as Apache Kafka, Flink, Pinot, and Presto, and describing the architectural components, scaling challenges, multi‑region resilience, data back‑filling, and operational practices that enable low‑latency analytics for millions of daily rides and deliveries.

21CTO
21CTO
21CTO
Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale

Background

Uber, founded in 2009, transformed the taxi market with a mobile app that connects drivers and passengers. By 2023, over 137 million monthly users rely on Uber and Uber Eats, generating 9.44 billion trips. Real‑time data analysis powers dynamic pricing, fraud detection, and machine‑learning predictions.

Context

Uber’s business is highly real‑time. Data streams continuously from drivers, riders, restaurants, and backend services. The platform processes this data to extract actionable insights for many use cases.

Core Real‑Time Requirements

Consistency across regions

99.99% availability

Freshness (typically sub‑second)

Low latency (p99 < 1 s for queries)

Scalability with exponential data growth

Cost‑effective processing

Flexibility via programmable and declarative interfaces

Key Building Blocks

Storage : Object/blob store with write‑after‑read consistency for long‑term data.

Stream : Publish/subscribe layer optimized for low latency and at‑least‑once semantics.

Compute : Applies logic on streams and storage.

OLAP : Limited SQL layer for analytical queries, supporting at‑least‑once semantics.

SQL : Query layer above compute and OLAP, compiles SQL to compute functions.

API : Programmatic access for high‑level applications.

Metadata : Versioned metadata management across all layers.

Apache Kafka

Uber runs one of the world’s largest Kafka deployments, handling trillions of messages and petabytes of data daily. Enhancements include:

Cluster Federation

Logical clusters hide physical cluster details from producers and consumers.

Dedicated servers route client requests to the appropriate physical cluster.

Horizontal scaling by adding clusters without restarting applications.

Dead Letter Queue (DLQ)

Messages that cannot be processed after retries are sent to a DLQ, preventing pipeline blockage.

Consumer Proxy

A thin gRPC layer reads from Kafka, abstracts client libraries, retries failed deliveries, and pushes messages to DLQ after repeated failures, improving throughput and language support.

Cross‑Cluster Replication

Uber replicates Kafka clusters across data centers for global visibility and redundancy, using the open‑source uReplicator built on Apache Helix for cluster management and a Chaperone service for data‑loss detection.

Apache Flink

Flink provides a high‑throughput, low‑latency distributed stream processing framework. Uber contributes Flink SQL, which translates SQL into Flink jobs, handling resource estimation, auto‑scaling, and fault recovery.

Unified Deployment & Management

The platform layer defines business logic, the job‑management layer handles lifecycle, and the compute/storage layer abstracts physical resources (e.g., HDFS, S3).

Apache Pinot

Pinot is Uber’s low‑latency OLAP engine. It supports upserts, various index types, and integrates with Kafka, Flink, and Presto. Uber open‑sourced enhancements such as upsert support, primary‑key routing, and a Presto connector that pushes predicates to Pinot.

Distributed File System (HDFS)

Long‑term data is stored in HDFS, with Avro logs compacted to Parquet for downstream processing via Hive, Presto, or Spark. Both Flink checkpoints and Pinot segment archives use HDFS.

Presto

Uber uses Presto as an interactive query engine, building a Pinot connector to enable standard PrestoSQL queries on real‑time data.

Use Cases

Peak‑Pricing : Kafka → Flink (ML‑based algorithm) → key‑value store for fast lookup; prioritizes freshness over consistency.

UberEats Restaurant Dashboard : Pinot with star‑tree indexes serves low‑latency slice queries; Flink pre‑aggregates data.

Real‑Time Prediction Monitoring : Thousands of models run in Flink, aggregating metrics and detecting anomalies; results stored in Pinot.

UberEats Ops Automation : Real‑time data in Pinot queried via Presto to drive rule‑based alerts, especially during COVID‑19.

Active‑Active Strategy

Multi‑region Kafka clusters provide redundancy; Flink jobs compute regional pricing; a leader election service designates a primary region, with failover to another region when needed.

Data Back‑Filling

Uber reprocesses historic streams using Flink in two modes: SQL‑based (same query on Kafka and Hive) and API‑based (Kappa architecture) to support testing, model training, and pipeline fixes.

Lessons Learned

Open‑Source Adoption : Significant engineering effort is required to adapt OSS components for diverse use cases and languages.

Rapid Development & Evolution : Standardized interfaces, monorepo management, thin clients, and language‑specific SDKs accelerate iteration.

Operations & Monitoring : Declarative frameworks automate deployments; real‑time dashboards and alerts are built on Kafka, Flink, and Pinot.

User Onboarding & Debugging : Centralized metadata stores, end‑to‑end audit trails, and drag‑and‑drop UI simplify data discovery and pipeline creation.

References

Yupeng Fu & Chinmay Soman, “Uber’s Real‑Time Data Infrastructure” (2021)

Mansoor Iqbal, “Uber Revenue and Usage Statistics” (2024)

Arpit Bhayani, “Understanding Consistency”

Alex Xu, “At‑Least‑Once, Exactly‑Once” (2022)

Hongliang Xu, “uReplicator: Scalable Robust Kafka Replicator” (2018)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Real-TimeBig DataFlinkStreamingKafkaPinot
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.