How Uber Built Schemaless: A Scalable Schema‑Free MySQL‑Based Store
Uber’s Schemaless is a highly available, horizontally scalable MySQL‑based key‑value store that abandons fixed schemas, supports JSON blobs, triggers, and global secondary indexes, and was created to meet five critical requirements for trip data storage, including linear expansion, write throughput, and reliable operations.
Why Uber Needed a New Database
In early 2014 Uber’s rapid growth in ride‑hailing exhausted PostgreSQL storage capacity. Adding new cities and handling millions of trips required a system that could linearly scale, provide high write throughput, notify downstream services without loss, support secondary indexes, and remain operable under extreme load.
Five Core Requirements
Linear horizontal scaling by adding servers.
High‑throughput, low‑latency writes with immediate read‑after‑write capability.
Reliable change notifications for downstream components.
Support for secondary indexes compatible with existing PostgreSQL queries.
Operational reliability for mission‑critical trip data.
Design Decision
After evaluating Cassandra, Riak, MongoDB and others, Uber chose to build its own solution, drawing inspiration from Friendfeed’s architecture and Pinterest’s operational practices. The result is Schemaless, a key‑value store that accepts any JSON document without a predefined schema.
Data Model Overview
Schemaless is a sparsely populated, three‑dimensional persistent hash table similar to Google’s Bigtable. The smallest unit is a cell , which is immutable once written. Each cell consists of:
rowkey : a UUID acting like a primary key.
column name : an arbitrary string defined by the application.
ref key : an integer version identifier; the cell with the highest ref key is the latest.
value : a JSON blob.
Applications group related data into the same column, allowing bulk updates without downtime. The following diagram illustrates the overall architecture:
Trip Data Model Example
A trip consists of timestamps, driver and passenger IDs, fare details, and optional notes. In Schemaless this is stored as multiple cells across columns such as BASE, STATUS, NOTES, and FARE_ADJUSTMENT. The simplified model is shown below:
Secondary Indexing
Schemaless allows defining indexes on fields inside JSON blobs. An index is sharded by a designated shard field (preferably a UUID) so that queries touch only one shard. The following YAML defines a driver‑partner index that denormalizes driver_partner_uuid, city_uuid, and trip_created_at from the BASE column:
table: driver_partner_index
datastore: trips
column_defs:
- column_key: BASE
fields:
- { field: driver_partner_uuid, type: UUID }
- { field: city_uuid, type: UUID }
- { field: trip_created_at, type: datetime }Queries can filter on city_uuid or trip_created_at to retrieve all trips for a given driver. Because the index is sharded, a single‑shard lookup returns the matching cells, yielding low latency.
Indexes are eventually consistent; writes to a cell and its index occur in separate transactions (often via a two‑phase commit), introducing a typical <20 ms> lag but avoiding the overhead of strong consistency.
Operational Considerations
The system relies on MySQL master nodes with write buffers to survive MySQL failures, and a publish‑subscribe trigger mechanism notifies downstream services of data changes. Reliability was the decisive factor in choosing to build Schemaless rather than adopt an off‑the‑shelf solution.
Conclusion
This article presented Schemaless’s core data model, trigger architecture, and secondary‑index capabilities that enable Uber to store and query massive volumes of trip data reliably. Future posts will explore additional features such as MySQL‑based storage nodes, fault‑tolerant client‑side triggers, and deeper performance analyses.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
