Product Management 15 min read

Mastering AB Testing: From Basics to Scalable Multi‑Layer Architecture

This article explains the fundamentals of AB testing, outlines the iterative workflow, shares best‑practice guidelines, compares single‑layer and multi‑layer experiment frameworks, and details the technical implementation—including SDK design, hashing algorithms, data denoising, and statistical evaluation methods.

NetEase Media Technology Team
NetEase Media Technology Team
NetEase Media Technology Team
Mastering AB Testing: From Basics to Scalable Multi‑Layer Architecture

What Is AB Testing?

AB testing, also known as a controlled experiment or bucket test, involves designing multiple variants for the same goal and exposing different user groups to each variant (A, B, or N variants). User behavior is recorded and analyzed to determine which variant performs best.

Basic Workflow

Define the testing objective.

Design multiple optimization proposals.

Determine experiment variants and traffic allocation ratios.

Run the online experiment.

Collect user data and analyze results.

Release the winning version or iterate with new variants.

Key Best Practices

Choose scientific optimization metrics such as click‑through rate, entry rate, average session duration, or average refresh count; avoid absolute numbers like UV or PV that are affected by random traffic splits.

Random and fixed user grouping ensures each user is consistently assigned to the same bucket while maintaining demographic similarity across groups.

Maintain sufficient sample size per bucket to reduce the impact of outliers; if traffic is limited, extend the experiment duration.

Apply rigorous statistical analysis using variance analysis (ANOVA) or T‑tests to evaluate significance, with p‑values < 0.05 indicating a meaningful difference.

Experiment Architecture

Single‑Layer Framework

A simple hash‑based bucket system maps a user’s unique identifier to one of 100 (or 1,000) buckets, ensuring each user participates in only one experiment at a time. This approach is easy to analyze but suffers from limited scalability and traffic contention.

Multi‑Layer Overlapping Framework

Popularized by Google, this architecture divides traffic into independent domains (vertical slices) and layers (horizontal slices). Each layer applies its own hash function, guaranteeing orthogonal experiments that can run concurrently without interference.

Variables and White‑List

Variables are the core elements of an AB test; an AA test uses identical variable values across groups to validate the experiment setup. A white‑list allows specific users (e.g., testers) to bypass traffic allocation and receive a predetermined variant, though their data is excluded from analysis.

Technical Implementation

System Architecture – Backend services are built in Java, using the DSF framework for service registration. Clients use Java and C++ SDKs to fetch experiment configurations via HTTP long‑polling (e.g., every 30 seconds) and cache them in memory.

Data Storage – MySQL combined with Redis caching for fast reads; experiment data is also written to HDFS for offline analysis.

Embedded SDK – Provides APIs to retrieve bucket assignments and experiment parameters directly from memory, minimizing impact on the host application’s performance.

Fault Tolerance – Two‑layer protection on both server and SDK sides prevents platform failures from affecting the production system.

Hashing with MurmurHash3

/** Returns the MurmurHash3_x86_32 hash. */
public static int murmurHash3_x86_32(byte[] data, int offset, int len, int seed)

The algorithm hashes a user’s unique ID (data) with a layer‑specific seed, ensuring random yet reproducible bucket assignments and orthogonal layers.

Data Denoising

Before analysis, outliers are removed using box‑plot statistics: calculate Q1, Q2 (median), Q3, then derive the inter‑quartile range (IQR = Q3‑Q1). Upper and lower bounds are set as Q3 + 1.5·IQR and Q1 ‑ 1.5·IQR, respectively, to filter abnormal samples.

Statistical Evaluation

Two‑sample T‑tests, one‑factor ANOVA, and two‑factor ANOVA are employed to assess significance. A p‑value below 0.05 indicates a statistically significant difference between groups.

Conclusion

The described AB testing platform combines a robust multi‑layer architecture, efficient hashing, systematic data cleaning, and rigorous statistical methods to enable scalable, reliable experimentation across the organization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendAB testingdata analysisproduct-managementHashingstatistical analysisexperiment design
NetEase Media Technology Team
Written by

NetEase Media Technology Team

NetEase Media Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.