Bayesian A/B Testing with PyMC3: A Practical Guide
This article introduces the motivation and logic behind A/B testing, highlights common misunderstandings of p‑values, and demonstrates how Bayesian A/B testing using PyMC3 can provide intuitive probability statements about which variant performs better, complete with Python code examples.
Choosing between two options can be done with A/B testing. This article explains the motivation behind A/B tests, the pitfalls of p‑values, and introduces a Bayesian approach that avoids p‑value misinterpretations.
Imagine an online store with 10,000 daily visitors and a conversion rate of about 1%. By randomly assigning half the visitors to see a blue button (control) and half to see a red button (variant), you can measure which button yields a higher conversion rate.
Randomization must be truly random; otherwise, confounding factors such as gender or time‑of‑week could bias the results.
Preparing the A/B test
Assume you have collected data for 10,000 visitors, encoded purchases as 1 and non‑purchases as 0. The following Python code simulates the data:
import numpy as np
np.random.seed(0)
blue_conversions = np.random.binomial(1, 0.01, size=4800)
red_conversions = np.random.binomial(1, 0.012, size=5200)Printing the simulated arrays shows mostly zeros, reflecting the low conversion rates.
print(blue_conversions)
# output: [0 0 0 ... 0 0 0]
print(red_conversions)
# output: [0 0 0 ... 0 0 0]Calculating the observed conversion rates:
print(f'Blue: {blue_conversions.mean():.3%}')
print(f'Red: {red_conversions.mean():.3%}')
# output: Blue: 0.854%, Red: 1.135%These numbers suggest the red button may be better, but we need statistical evidence to rule out chance.
Traditional (frequentist) approach
Using Welch's t‑test via SciPy yields a p‑value of 7.8%:
from scipy.stats import ttest_ind
print(f'p-value: {ttest_ind(blue_conversions, red_conversions, equal_var=False, alternative="less").pvalue:.1%}')
# output: p-value: 7.8%Because 7.8% > 5%, we fail to reject the null hypothesis, and the result is inconclusive. The article also lists common misconceptions about p‑values.
Bayesian A/B testing advantages
Provides a direct probability that one variant is better than the other.
Requires only a generative model and Bayesian inference, not a suite of statistical tests.
Using PyMC3, we model the conversion rates with Beta(1, 99) priors and Bernoulli likelihoods:
import pymc3 as pm
with pm.Model():
blue_rate = pm.Beta('blue_rate', 1, 99)
red_rate = pm.Beta('red_rate', 1, 99)
blue_obs = pm.Bernoulli('blue_obs', blue_rate, observed=blue_conversions)
red_obs = pm.Bernoulli('red_obs', red_rate, observed=red_conversions)
trace = pm.sample(return_inferencedata=True)Posterior analysis shows maximum‑likelihood estimates of 0.854% for blue and 1.135% for red, with credible intervals.
To answer the key question—"What is the probability that the red variant is better?"—we compare posterior samples:
blue_rate_samples = trace.posterior['blue_rate'].values
red_rate_samples = trace.posterior['red_rate'].values
print(f'Probability that red is better: {(red_rate_samples > blue_rate_samples).mean():.1%}.')
# output (for me): Probability that red is better: 91.7%.The result indicates roughly a 92% chance that the red button outperforms the blue one, a clear and intuitive metric for decision‑makers.
Conclusion
A/B testing—whether classic or Bayesian—allows you to isolate the effect of a single change (e.g., button color) by randomizing users into control and treatment groups. Bayesian A/B testing avoids the confusing interpretation of p‑values and delivers a probability that directly answers business questions, all with relatively little Python code.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.