Fundamentals 10 min read

A/B Testing and Causal Inference: Evolution of Sampling, Metric Evaluation, and Statistical Inference

The article reviews the development of online A/B testing, covering sampling and traffic‑splitting techniques, metric computation improvements, statistical inference advances, and current challenges such as interference, real‑time inference, and large‑scale metric computation, while referencing recent research papers.

DataFunTalk
DataFunTalk
DataFunTalk
A/B Testing and Causal Inference: Evolution of Sampling, Metric Evaluation, and Statistical Inference

Development of Sampling and Traffic Splitting

Traditional A/B tests pre‑assign users to groups before the experiment starts, similar to medical trials. In internet products, real‑time sampling (e.g., 1% of users for each arm) allows users to enter the experiment gradually, requiring efficient engineering solutions such as cryptographic hash‑based sampling and double‑hashing to avoid memory effects.

More advanced perfect‑random splitting can be built on finite‑field theory, as described in the paper Orthogonal Traffic Assignment in Online Overlapping A/B Tests .

Evolution of Metric Computation

Early online experiments mimicked BI reports by aggregating daily user‑level metrics without deduplication, leading to biased estimates. Proper metric definitions require deduplication across days (e.g., unique users for average click count), which yields correct causal conclusions.

Long‑term effect estimation is also discussed, with reference to Estimating Causal Effects of Long‑Term Treatments (EC'23), which shows how to project experiment results to yearly OKR contributions.

Advances in Statistical Inference for A/B Tests

When the Stable Unit Treatment Value Assumption (SUTVA) holds, fixed‑sample inference is straightforward, but real‑time sampling introduces challenges such as varying user activity composition over time and peeking effects. Recent work ( Enhancing External Validity of Experiments with Ongoing Sampling Process , EC'24) proposes methods to determine optimal stopping times.

If SUTVA is violated, interference between arms must be addressed. Examples include social‑network spillover, broadcaster competition, and recommendation system feedback. Solutions involve graph‑based sampling, interference‑aware statistics, and modeling approaches, illustrated by papers such as Optimized Covariance Design for AB Test on Social Network under Interference (NeurIPS'24), Unbiased Estimation for Total Treatment Effect Under Interference Using Aggregated Dyadic Data (MitCoDE'23), and Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks (EC'24).

Current Challenges

Scaling to massive data volumes and a growing number of experiments demands fast, accurate, and stable metric computation, as presented in Large-Scale Metric Computation in Online Controlled Experiment Platform (VLDB'24). Additionally, extending causal inference to scenarios where experiments cannot be run directly remains an open problem, addressed by the open‑source library Fast‑Causal‑Inference: a Causal Inference Tool at Scale (MitCoDE'23).

A/B testingsamplingcausal inferenceexperiment designonline experimentsMetric Evaluation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.