Bayesian A/B Testing with PyMC3: A Practical Guide
This article introduces the motivation and logic behind A/B testing, highlights common misunderstandings of p‑values, and demonstrates how Bayesian A/B testing using PyMC3 can provide intuitive probability statements about which variant performs better, complete with Python code examples.
Choosing between two options can be done with A/B testing. This article explains the motivation behind A/B tests, the pitfalls of p‑values, and introduces a Bayesian approach that avoids p‑value misinterpretations.
Imagine an online store with 10,000 daily visitors and a conversion rate of about 1%. By randomly assigning half the visitors to see a blue button (control) and half to see a red button (variant), you can measure which button yields a higher conversion rate.
Randomization must be truly random; otherwise, confounding factors such as gender or time‑of‑week could bias the results.
Preparing the A/B test
Assume you have collected data for 10,000 visitors, encoded purchases as 1 and non‑purchases as 0. The following Python code simulates the data:
import numpy as np
np.random.seed(0)
blue_conversions = np.random.binomial(1, 0.01, size=4800)
red_conversions = np.random.binomial(1, 0.012, size=5200)Printing the simulated arrays shows mostly zeros, reflecting the low conversion rates.
print(blue_conversions)
# output: [0 0 0 ... 0 0 0]
print(red_conversions)
# output: [0 0 0 ... 0 0 0]Calculating the observed conversion rates:
print(f'Blue: {blue_conversions.mean():.3%}')
print(f'Red: {red_conversions.mean():.3%}')
# output: Blue: 0.854%, Red: 1.135%These numbers suggest the red button may be better, but we need statistical evidence to rule out chance.
Traditional (frequentist) approach
Using Welch's t‑test via SciPy yields a p‑value of 7.8%:
from scipy.stats import ttest_ind
print(f'p-value: {ttest_ind(blue_conversions, red_conversions, equal_var=False, alternative="less").pvalue:.1%}')
# output: p-value: 7.8%Because 7.8% > 5%, we fail to reject the null hypothesis, and the result is inconclusive. The article also lists common misconceptions about p‑values.
Bayesian A/B testing advantages
Provides a direct probability that one variant is better than the other.
Requires only a generative model and Bayesian inference, not a suite of statistical tests.
Using PyMC3, we model the conversion rates with Beta(1, 99) priors and Bernoulli likelihoods:
import pymc3 as pm
with pm.Model():
blue_rate = pm.Beta('blue_rate', 1, 99)
red_rate = pm.Beta('red_rate', 1, 99)
blue_obs = pm.Bernoulli('blue_obs', blue_rate, observed=blue_conversions)
red_obs = pm.Bernoulli('red_obs', red_rate, observed=red_conversions)
trace = pm.sample(return_inferencedata=True)Posterior analysis shows maximum‑likelihood estimates of 0.854% for blue and 1.135% for red, with credible intervals.
To answer the key question—"What is the probability that the red variant is better?"—we compare posterior samples:
blue_rate_samples = trace.posterior['blue_rate'].values
red_rate_samples = trace.posterior['red_rate'].values
print(f'Probability that red is better: {(red_rate_samples > blue_rate_samples).mean():.1%}.')
# output (for me): Probability that red is better: 91.7%.The result indicates roughly a 92% chance that the red button outperforms the blue one, a clear and intuitive metric for decision‑makers.
Conclusion
A/B testing—whether classic or Bayesian—allows you to isolate the effect of a single change (e.g., button color) by randomizing users into control and treatment groups. Bayesian A/B testing avoids the confusing interpretation of p‑values and delivers a probability that directly answers business questions, all with relatively little Python code.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
