Gairos: Uber’s Real‑Time Data Processing, Storage and Query Platform and Its Scalability Optimizations
The article describes Uber’s Gairos platform—a unified real‑time data ingestion, storage, and query system built on Kafka, Elasticsearch and RT‑Gairos—detailing its architecture, Uber use cases such as dynamic pricing, the scalability and reliability challenges faced, and the suite of optimization techniques (sharding, query routing, caching, index merging, template tuning, and data‑pruning) applied to achieve low latency and high throughput.
Uber leverages real‑time data (ride requests, driver availability, weather, etc.) to make decisions like dynamic pricing and ETA prediction. To turn streaming data into actionable insights every minute, Uber built Gairos, a platform for large‑scale, high‑efficiency data exploration.
Why Gairos? Each Uber team previously managed its own data pipelines and query services, which required constant monitoring and prevented focus on system optimization. Gairos provides a unified real‑time data processing, storage, and query layer, allowing teams to concentrate on business logic.
Architecture Overview (see Fig. 1): data is consumed from multiple Apache Kafka topics, written into several Elasticsearch clusters, and persisted in HDFS for long‑term analysis. Clients issue queries to the Gairos Query Service (RT‑Gairos), which routes, caches, and forwards queries to the appropriate Elasticsearch shards.
Uber Use Cases include dynamic pricing, driver‑side pricing suggestions, and supply‑demand forecasting. The article walks through a typical trip lifecycle, illustrating how events flow from passenger and driver apps through Uber’s dispatch system and into Gairos for real‑time analysis.
Scalability & Reliability Challenges arose as use cases grew: >1 million events per second, 4.5 trillion records, >20 clusters, and 30+ production pipelines. Problems listed include multi‑tenant instability, ingestion latency, query spikes, unused data sources, heavy queries, Elasticsearch master failures, node overloads, shard loss, and high on‑call costs.
Optimization Strategies applied:
Data‑driven sharding and query routing, increasing concurrent query capacity up to four‑fold (Fig. 9‑11).
Cache based on query signatures and patterns, boosting cache hit rates (80%+) and QPS up to ten‑fold (Fig. 20‑21).
Index merging to remove deleted documents and reduce index size.
Heavy‑query handling: query splitting, rate‑limiting, rolling tables, and offloading to Hive/Presto.
Index‑template tuning based on field usage (filter, aggregation, fuzzy search).
Shard‑size control (≤60 GB) and write/read QPS limits (≤3000 QPS) to avoid hot spots.
Pruning unused indices and data sources to cut shard count from ~40 k to ~20 k.
Benchmark results (Figs. 13‑18, 20‑21, 22‑24) demonstrate that sharding reduces latency and increases supported concurrent users, while caching dramatically lowers latency and raises QPS.
Future Work includes extending optimizations to all data sources, automating the optimization pipeline, and exploring machine‑learning‑based query analysis.
Author: Uber official blog (translated). Source: InfoQ China
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
