Databases 32 min read

How Uber Built Schemaless: A Scalable MySQL‑Based No‑Schema Datastore

This article explains how Uber engineered Schemaless, a highly available, horizontally scalable datastore built on MySQL that stores immutable JSON cells without a fixed schema, detailing its design goals, architecture, data model, trigger system, indexing strategy, and fault‑tolerant read/write mechanisms.

dbaplus Community

Apr 28, 2016

How Uber Built Schemaless: A Scalable MySQL‑Based No‑Schema Datastore

Background and Motivation

In early 2014 Uber’s ride‑hailing service outgrew its single PostgreSQL instance, threatening to run out of storage space as the number of trips and cities increased. The engineering team needed a next‑generation datastore that could scale linearly, provide high write throughput, support downstream notifications, offer secondary indexes, and be operable by the on‑call team.

Key Requirements

Linear horizontal scalability by adding servers.

High write capacity with a Redis‑like buffering mechanism that is instantly readable.

Reliable downstream change notifications (similar to Kafka) without data loss.

Support for secondary indexes compatible with existing PostgreSQL queries.

Operational reliability for critical trip data.

Design Overview

Uber evaluated several open‑source stores (Cassandra, Riak, MongoDB) and concluded that none met all requirements, especially operational reliability. The decision was to build a custom solution, inspired by Friendfeed’s storage model and Pinterest’s operational practices.

Schemaless Architecture

Schemaless consists of two node types: work nodes that receive client HTTP requests, route them to storage nodes, and aggregate results; and storage nodes that hold the data. Data is sharded into a fixed number of shards (typically 4096), each mapped to one or more storage nodes with replication across data centers.

Write operations are idempotent, allowing safe retries. Reads can be served from any replica, defaulting to the primary to guarantee the latest data.

Data Model

The fundamental unit is a cell , an immutable JSON blob identified by a row_key (UUID), a column_name (string), and a ref_key (integer version). Cells cannot be overwritten; a new version is written with a larger ref_key. Columns are defined by the application, enabling sparse, schema‑less storage.

Example trip model (simplified):

Columns such as BASE, STATUS, NOTES, and FARE_ADJUSTMENT store different aspects of a trip. Writes to each column are independent, minimizing contention.

Triggers

Schemaless provides a trigger framework that notifies downstream services when a cell is written. Triggers are defined via an @trigger annotation on functions or on specific columns. For example, when a BASE cell is written, a trigger invokes the billing service to charge the rider, then writes the result back to the STATUS column.

Triggers are guaranteed to be called at least once per cell; they must be idempotent to handle possible duplicate invocations.

Indexing

Secondary indexes are built on fields inside the JSON blob. An index is defined on a shard‑key field (e.g., driver_partner_uuid) plus any non‑standard fields needed for queries. Queries are fast because they touch only a single shard.

MySQL Backend

Each shard is a separate MySQL database. A table stores cells with columns added_id (auto‑increment primary key), row_key, column_name, ref_key, and body (MessagePack‑compressed JSON). A composite index on (row_key, column_name) enables fast lookups.

Buffer Writes and Fault Tolerance

Writes are sent to both a primary cluster and a secondary (buffer) cluster. The client reports success only after both writes succeed, reducing the chance of data loss if the primary master fails before replication. The buffer cluster holds writes in a special table until the primary’s asynchronous replication catches up.

Scalability and Reliability of Triggers

Trigger workers are assigned specific shards. A leader coordinates shard assignment and re‑balances work when a worker fails. Failed cells can be marked and moved to a dead‑letter queue to avoid blocking the pipeline. Leader election and offset storage use ZooKeeper or Schemaless itself, ensuring continuity across restarts.

Conclusion

Schemaless combines a schema‑less JSON cell model with MySQL’s proven storage engine, delivering a highly available, horizontally scalable datastore that powers Uber’s core trip processing pipeline. Its trigger and indexing mechanisms decouple producers and consumers, while buffer writes and shard‑level ordering provide strong fault tolerance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing MySQL NoSQL Uber Triggers Schemaless Scalable Datastore

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.