A/B significance testing has become irresistibly simple. Plug a few numbers in an online calculator, and voilà... statistically verified results.
But this on-demand verification is fatally flawed: Looking at results more than once invalidates their statistical significance. Every page refresh on your A/B test dashboard is tainting your outcome. Here’s why it happens and how you can fix it.
P-values and p-hacking
The p-value expresses the likelihood of false positives. P-hacking is the practice of calculating the p-value using one process but conducting the experiment using a different process.
An example from the XKCD web comic:
The subject calculates p = 0.01 based on the process “I think of a number and he guesses it.” However, the actual process was “...and repeats this until he gets it right.” Her p-value test was based on a different process than what was actually used.
A/B tests often have a similar problem. Pearson chi-squared, Fisher exact, Student t, etc. -- all assume the following process, diagramed with Lucidchart below:
When followed, this process is mathematically guaranteed to have a false positive rate of only 5%.
However, most people want to (1) cut failed experiments as soon as possible and (2) promote successful experiments as soon as possible. So they refresh the test results (aka peek), hoping to obverse significance as soon as it happens.
The problem is that this is a different process than our p-value was created for.
Let’s see how much of a difference peeking makes. Suppose we target a conversion event with 20% baseline success and accept p < 0.10. Let’s consider what happens when (1) B also converts at 20% and (2) B converts at the modestly higher 22%.
The chances for accepting A or B, over the size of the fixed sample:
- As expected, when there is no difference, the false positive rate is 5% for A and 5% for B.
- When there is a difference, A is favored, and detection likelihood increases with sample size.
The cumulative chances for accepting A or B when the p-value is checked every sample (min 200 samples):
- After 2000 samples, there is a combined 55% chance of incorrectly concluding that one is better the other -- over five times the expected false positive rate of 0.10.
- When there is a difference, the chance of accepting the loser as the statistically significant winner jumps from nearly nothing to 10%.
The feedback loop has altered the process and destroyed the validity of the statistics.
What to do?
The simplest solution is to use the significance tests as they were designed: with a fixed sample size. Simple, but not practical.
Boss: “Variation B is doing great! Let’s give all users that experience.”
Underling: “We can’t. We have to wait another month.”
Boss: “Variation B is doing terrible! Shut it off right away!”
Underling: “We can’t. We have to wait another month.”
Boss: “Was A or B better?”
Underling: “We couldn’t detect a significant difference.”
Boss: “Keep running it.”
Underling: “We can’t. That was our only chance.”
Alternatively, we can still peek at the results but account for the overconfidence that peeking causes. If we want p < 0.10, we’ll, say, accept only p < 0.02 on a particular peek. Naturally, it will takes much longer to reach significance. (This is in fact Optimizely’s approach, although instead of assuming continuous peeking, it adjusts only when the experimenter views the results.)
A different paradigm
So far, we’ve been getting rather cagey answers from statistics. The fundamental problem is that we are asking it the wrong question.
We don’t want to know which variation is better as much as we want to maximize success.
When asking the more direct question, statistics can assist us better. The “maximize success” problem is known as the multi-armed bandit problem, and its solution is iteratively adjusting the sampling ratio to favor success.
Using Thompson beta sampling and readjusting every 20 samples, below the mean sampling rates for B as the experiment progress:
As expected, the sampling gradually adjusts to the results. Armed with this new approach, let’s try the stopping problem again. We’ll declare a winner when the B sampling proportion is below 5% or above 95%. Below are the cumulative acceptance probabilities:
Oh no. Those numbers look very similar to p-test peeking! It turns out that Bayesian statistics are not immune to the peeking problem. The universe does not hand out free lunches.
Except the paradigm has shifted. Previously, we obsessively hit F5 on the test dashboard to avoid big losses, or to capitalize on big wins. But that’s no longer needed, as the statistical process makes those decisions for us.
Instead, we can safely and confidently test for as long as we have patience. By removing the urgent need to stop, we side-step the stopping problem altogether.
How costly are A/B tests? Below are the overall success rates for our various algorithms after 16,000 samples.
Each strategy makes a compromise between exploration vs. exploitation. Some do this better than others. Thompson beta sampling is the provably optimal strategy.
Summary of multi-armed bandit
- Optimum strategy for maximum successes.
- No requirement for predetermined sample sizes or other parameters.
- Codifying the process arguably makes ad-hoc alterations (p-hacking) less likely.
- Higher levels of significance become practical.
- Unlimited peeking.
- The test can incorporate prior knowledge or risk assessment, via the choice of the initial sampling weights.
- Sampling ratios must be adjusted. Google Analytics content experiments already run multi-armed bandit tests for you, but for other tools you may need to use a calculator and update the sampling ratio yourself.
- Convergence is slower relative to total sample size. A fixed 50/50 sampling ratio aims for the fewest total samples, whereas multi-armed bandit aims for the fewest total failures.
The appropriateness of Thompson sampling depends on how well its goal of maximizing test successes matches our objective.
Whatever your approach, make sure you apply the correct statistics to the correct process. You can even diagram it with Lucidchart!
More precisely, peeking and taking action on the test invalidates the significance. This is usually the intent; completely idle, unactionable curiosity is less frequent.
Know that bandit adjustment periods require the same attention to experimental design as fixed-ratio tests. If you know conversion rates are higher in the morning, p-value testing should include a full day; bandit sampling adjustment should include the same. If there is a delay between treatment and conversion, p-value testing should consider only sufficiently mature data; bandit sampling adjustment should consider the same. This seems obvious, but some have been surprised.
All simulations used can be found on GitHub.