Operations 25 min read

Rebuilding Uber’s Experimentation Platform: Architecture, Goals, and Lessons Learned

After more than a year of effort, Uber migrated its entire experimentation and feature‑flag ecosystem—including thousands of developers, dozens of partners, multiple mobile apps, and hundreds of services—to a new, unified platform that improves reliability, flexibility, and data quality while retiring over 50,000 legacy experiments.

DevOps

Sep 16, 2022

Rebuilding Uber’s Experimentation Platform: Architecture, Goals, and Lessons Learned

1. Introduction Uber’s original experiment system, Morpheus, was built over seven years ago and grew far beyond its initial scope, supporting both feature flags and A/B tests for millions of users. Over time, many experiments suffered from data quality issues, required costly re‑runs, and hindered rapid decision‑making.

In early 2020, the team identified that a large proportion of experiments were flawed, often needing to be re‑executed due to poor data collection, custom analysis pipelines, and fragile integration with mobile and backend services.

2. Goals for the New System The new platform aims to enable Uber to run experiments quickly and at high quality, providing strong guarantees of correctness, high reliability, and improved developer productivity. It must support diverse experiment designs, be resilient to failures, and decouple experiment logic from code deployments.

Key quality goals include: (1) delivering trustworthy results that inform good decisions, and (2) ensuring results are reproducible without extensive manual validation.

3. Architecture The redesign introduces a parameter‑driven model where client code references parameters rather than experiment names. Parameters have safe default values, and the backend can rewrite them per experiment, allowing instant experiment changes without code releases.

The system sits on top of Uber’s existing configuration service, Flipr, unifying mobile, backend, and feature‑flag configurations. Experiments consist of three core components: randomization (hash‑based bucket assignment), treatment plans (mapping context and bucket to parameter values), and logging (capturing exposure events).

Randomization uses a salted hash of a unit identifier to assign it to a bucket; buckets are grouped into experiment arms, enabling hierarchical splits and complex designs. Treatment plans define actions (parameter values) based on context such as geography or device type. Logging records the first exposure of a unit to a non‑default parameter, supporting downstream analysis while remaining transparent to client code.

Parameter constraints allow experiments to run only under specific conditions (e.g., US iOS users), and the system prevents circular dependencies by enforcing a DAG.

Data pipelines are generic, emitting a uniform set of fields (experiment key, unit ID, timestamps, context, parameter name, etc.) so that analysts can apply any metric without custom pipeline changes. A Python‑based analysis package replaces the previous Scala implementation, enabling data scientists to work in Jupyter notebooks and integrate with Uber’s metric system (uMetric).

Reliability is enhanced through multi‑layer fallbacks: safe defaults, local Flipr defaults, SDK caching, and parameter pre‑fetching. SDKs for all major languages (Go, Java, Android, iOS, JavaScript) automatically log exposures and handle network failures gracefully.

4. Challenges and Lessons Learned Building the platform required close collaboration between engineering and data‑science teams, as statistical correctness and system performance are tightly coupled. Early and deep integration with partner systems accelerated adoption. Continuous communication with users—through hearings, demos, and feedback loops—was essential for successful rollout.

Adoption was staged, with high‑risk experiments migrated first, followed by systematic deprecation of legacy experiments using custom tooling.

5. Conclusion The new experimentation platform now supports over 2,000 developers, 15+ partner systems, 10+ mobile apps, and 350+ services, while retiring more than 50,000 outdated experiments. Future work will focus on expanding features, improving usability, performance, automated monitoring, and further enhancing Uber’s experimentation capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

software architecture A/B testing Experiment Platform Uber Parameterization

Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.