Inside Uber’s Schemaless: Designing a Scalable MySQL‑Based Datastore
Uber built Schemaless, a MySQL‑backed, sharded JSON datastore with immutable cells, triggers, and secondary indexes, to overcome PostgreSQL limits and achieve linear scalability, high write throughput, reliable change notifications, and operational resilience for its ride‑hailing platform.
Uber’s Need for a New Database
In early 2014 Uber’s rapid growth exhausted PostgreSQL storage, prompting a multi‑month effort to design a next‑generation database that could scale linearly by adding servers.
Key Requirements
Linear horizontal scalability with reduced response time.
High write throughput and immediate read‑after‑write capability.
Reliable downstream change notification.
Support for secondary indexes compatible with existing PostgreSQL queries.
Operational reliability for critical ride‑hailing workloads.
After evaluating Cassandra, Riak, MongoDB and others, Uber chose to build its own solution, inspired by Friendfeed and Pinterest.
Schemaless Overview
Schemaless is a MySQL‑backed, sharded, sparse three‑dimensional persistent hash table similar to Google’s Bigtable. The immutable unit is a cell identified by a UUID row_key, a column_name, and a monotonically increasing ref_key. Cells store JSON blobs and can be versioned by writing a new cell with a larger ref_key.
Data Model Example
For Uber trips the model uses columns such as BASE , STATUS , NOTES , and FARE_ADJUSTMENT . Each trip (identified by a UUID) has cells in these columns; multiple versions of a cell are distinguished by ref_key. The diagram below illustrates two trips and their cells.
Triggers
Schemaless provides a publish‑subscribe trigger mechanism. When a cell is written, registered trigger functions (e.g., bill_rider) are invoked, allowing asynchronous processing such as payment handling. Triggers are idempotent and can be retried safely.
Indexing
Secondary indexes can be defined on fields inside the JSON blob. Index queries are fast because they target a single shard. An example driver‑partner index in YAML is shown below.
table: driver_partner_index
datastore: trips
column_defs:
- column_key: BASE
fields:
- { field: driver_partner_uuid, type: UUID}
- { field: city_uuid, type: UUID}
- { field: trip_created_at, type: datetime}Architecture
The system consists of stateless work nodes that route client HTTP requests to storage nodes. Data is sharded (default 4096 shards) and each shard is replicated across multiple MySQL instances (one master, two slaves). Reads may hit any replica; writes go to the master.
Buffered Writes
To tolerate master failures, writes are first sent to a secondary “buffer” cluster and then to the primary cluster. Only when both succeed is the client notified. This technique reduces the chance of data loss.
MySQL Backend
Each shard is a separate MySQL database containing an entity table with columns added_id (auto‑increment primary key), row_key, column_name, ref_key, body (MessagePack‑compressed JSON), and created_at. A composite index on (row_key, column_name, ref_key) enables efficient look‑ups.
Summary
Schemaless now powers many Uber services, offering high availability, linear scalability, and a flexible JSON‑centric data model built on MySQL.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
