Databases 13 min read

How Uber Built Schemaless: A Scalable Schema‑Free MySQL‑Based Store

Uber’s Schemaless is a highly available, horizontally scalable MySQL‑based key‑value store that abandons fixed schemas, supports JSON blobs, triggers, and global secondary indexes, and was created to meet five critical requirements for trip data storage, including linear expansion, write throughput, and reliable operations.

ITPUB
ITPUB
ITPUB
How Uber Built Schemaless: A Scalable Schema‑Free MySQL‑Based Store

Why Uber Needed a New Database

In early 2014 Uber’s rapid growth in ride‑hailing exhausted PostgreSQL storage capacity. Adding new cities and handling millions of trips required a system that could linearly scale, provide high write throughput, notify downstream services without loss, support secondary indexes, and remain operable under extreme load.

Five Core Requirements

Linear horizontal scaling by adding servers.

High‑throughput, low‑latency writes with immediate read‑after‑write capability.

Reliable change notifications for downstream components.

Support for secondary indexes compatible with existing PostgreSQL queries.

Operational reliability for mission‑critical trip data.

Design Decision

After evaluating Cassandra, Riak, MongoDB and others, Uber chose to build its own solution, drawing inspiration from Friendfeed’s architecture and Pinterest’s operational practices. The result is Schemaless, a key‑value store that accepts any JSON document without a predefined schema.

Data Model Overview

Schemaless is a sparsely populated, three‑dimensional persistent hash table similar to Google’s Bigtable. The smallest unit is a cell , which is immutable once written. Each cell consists of:

rowkey : a UUID acting like a primary key.

column name : an arbitrary string defined by the application.

ref key : an integer version identifier; the cell with the highest ref key is the latest.

value : a JSON blob.

Applications group related data into the same column, allowing bulk updates without downtime. The following diagram illustrates the overall architecture:

Schemaless architecture overview
Schemaless architecture overview

Trip Data Model Example

A trip consists of timestamps, driver and passenger IDs, fare details, and optional notes. In Schemaless this is stored as multiple cells across columns such as BASE, STATUS, NOTES, and FARE_ADJUSTMENT. The simplified model is shown below:

Trip data model diagram
Trip data model diagram

Secondary Indexing

Schemaless allows defining indexes on fields inside JSON blobs. An index is sharded by a designated shard field (preferably a UUID) so that queries touch only one shard. The following YAML defines a driver‑partner index that denormalizes driver_partner_uuid, city_uuid, and trip_created_at from the BASE column:

table: driver_partner_index

datastore: trips

column_defs:
  - column_key: BASE
    fields:
      - { field: driver_partner_uuid, type: UUID }
      - { field: city_uuid, type: UUID }
      - { field: trip_created_at, type: datetime }

Queries can filter on city_uuid or trip_created_at to retrieve all trips for a given driver. Because the index is sharded, a single‑shard lookup returns the matching cells, yielding low latency.

Indexes are eventually consistent; writes to a cell and its index occur in separate transactions (often via a two‑phase commit), introducing a typical <20 ms> lag but avoiding the overhead of strong consistency.

Operational Considerations

The system relies on MySQL master nodes with write buffers to survive MySQL failures, and a publish‑subscribe trigger mechanism notifies downstream services of data changes. Reliability was the decisive factor in choosing to build Schemaless rather than adopt an off‑the‑shelf solution.

Conclusion

This article presented Schemaless’s core data model, trigger architecture, and secondary‑index capabilities that enable Uber to store and query massive volumes of trip data reliably. Future posts will explore additional features such as MySQL‑based storage nodes, fault‑tolerant client‑side triggers, and deeper performance analyses.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ScalabilitydatabasesUberschema-less
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.