Operations 25 min read

Rebuilding Uber’s Experimentation Platform: Architecture, Goals, and Lessons Learned

After more than a year of effort, Uber migrated its entire experimentation and feature‑flag ecosystem—including thousands of developers, dozens of partners, multiple mobile apps, and hundreds of services—to a new, unified platform that improves reliability, flexibility, and data quality while retiring over 50,000 legacy experiments.

DevOps
DevOps
DevOps
Rebuilding Uber’s Experimentation Platform: Architecture, Goals, and Lessons Learned

1. Introduction Uber’s original experiment system, Morpheus, was built over seven years ago and grew far beyond its initial scope, supporting both feature flags and A/B tests for millions of users. Over time, many experiments suffered from data quality issues, required costly re‑runs, and hindered rapid decision‑making.

In early 2020, the team identified that a large proportion of experiments were flawed, often needing to be re‑executed due to poor data collection, custom analysis pipelines, and fragile integration with mobile and backend services.

2. Goals for the New System The new platform aims to enable Uber to run experiments quickly and at high quality, providing strong guarantees of correctness, high reliability, and improved developer productivity. It must support diverse experiment designs, be resilient to failures, and decouple experiment logic from code deployments.

Key quality goals include: (1) delivering trustworthy results that inform good decisions, and (2) ensuring results are reproducible without extensive manual validation.

3. Architecture The redesign introduces a parameter‑driven model where client code references parameters rather than experiment names. Parameters have safe default values, and the backend can rewrite them per experiment, allowing instant experiment changes without code releases.

The system sits on top of Uber’s existing configuration service, Flipr, unifying mobile, backend, and feature‑flag configurations. Experiments consist of three core components: randomization (hash‑based bucket assignment), treatment plans (mapping context and bucket to parameter values), and logging (capturing exposure events).

Randomization uses a salted hash of a unit identifier to assign it to a bucket; buckets are grouped into experiment arms, enabling hierarchical splits and complex designs. Treatment plans define actions (parameter values) based on context such as geography or device type. Logging records the first exposure of a unit to a non‑default parameter, supporting downstream analysis while remaining transparent to client code.

Parameter constraints allow experiments to run only under specific conditions (e.g., US iOS users), and the system prevents circular dependencies by enforcing a DAG.

Data pipelines are generic, emitting a uniform set of fields (experiment key, unit ID, timestamps, context, parameter name, etc.) so that analysts can apply any metric without custom pipeline changes. A Python‑based analysis package replaces the previous Scala implementation, enabling data scientists to work in Jupyter notebooks and integrate with Uber’s metric system (uMetric).

Reliability is enhanced through multi‑layer fallbacks: safe defaults, local Flipr defaults, SDK caching, and parameter pre‑fetching. SDKs for all major languages (Go, Java, Android, iOS, JavaScript) automatically log exposures and handle network failures gracefully.

4. Challenges and Lessons Learned Building the platform required close collaboration between engineering and data‑science teams, as statistical correctness and system performance are tightly coupled. Early and deep integration with partner systems accelerated adoption. Continuous communication with users—through hearings, demos, and feedback loops—was essential for successful rollout.

Adoption was staged, with high‑risk experiments migrated first, followed by systematic deprecation of legacy experiments using custom tooling.

5. Conclusion The new experimentation platform now supports over 2,000 developers, 15+ partner systems, 10+ mobile apps, and 350+ services, while retiring more than 50,000 outdated experiments. Future work will focus on expanding features, improving usability, performance, automated monitoring, and further enhancing Uber’s experimentation capabilities.

software architectureA/B testingReliabilityexperiment platformUberparameterization
DevOps
Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.