- Set up a test without any variations to calibrate results
- Figure out why an A/A test may be showing a false positive
- Understand the implications of A/A test results on regular A/B tests
When you run an experiment in Optimizely in which variation A is identical to variation B, you are running what is known as an A/A experiment, rather than an A/B experiment. Such an experiment is termed an A/A experiment because there essentially is no "B" variation. The original page and variation are exactly the same.
Why would you run an A/A test? Let’s distinguish an A/A test from a monitoring campaign, which is an experiment that doesn’t have any variations. In a monitoring campaign, the goal is simply to (a) deliver content to visitors, or (b) determine the baseline conversion rate for a certain goal before you test. An A/A test, on the other hand, is generally designed to let you test the effectiveness of your A/B testing platform.
In most cases, the results from this test will be inconclusive. In fact, the proportion of A/A tests showing inconclusive results will be at least as high as the statistical significance setting you select in your Project Settings (90% by default). This inconclusive result is expected because you made no changes to the original page. In some cases, however, you might see on the Results page that one variation is outperforming another or even that a winner is declared for one of your goals.
What happens here is that an experiment may reach statistical significance purely by chance. Statistical inference is probabilistic (in other words, it's about the chance of something happening), which here means it's not possible to report a result with absolute certainty. As with any experimental process, there will always be some percentage of outcomes that turn out to be anomalies, because an experiment calculates results on a random sample from the population of all visitors to your page. To identify any variation as significant, an experiment must at some point make a judgment call on how large of a trend is indicative of a true difference. And so a large-enough fake trend can make experiments look like there may be a true difference when none actually exists. This trade-off, between identifying more trends with significant results and seeing more errors, is controlled by the significance level.
Our Statistical Approach
We use a modified “p-value” to report statistical confidence, with a threshold of 0.1 (reported as a “90% statistical significance.”) Statistical significance is a project-level setting that can be adjusted up or down. You many want to adjust it based on your comfort level with statistical error—in a 90% statistical significance scenario, you have a 1 in 10 chance of falsely declaring a variation as a winner or loser when the variation actually has no impact. This holds true for an A/A test as well—even when there is no difference, there’s a small chance that a result can be reported based on underlying trends in experiment data.
When you examine the results of your A/A test, you should see the following behavior:
Your statistical significance will stabilize around a certain value over time. 10% of the time, statistical significance will stabilize above 90%.
The confidence intervals for your experiment will shrink as more data is collected, ruling out non-zero values.
At different points in the test results, the baseline and variation might be performing differently, but neither should be declared a statistically significant winner indefinitely.
If your cutoff for calling a winning or losing variation is 90% significance, then in the 10% of cases in which Optimizely’s Stats Engine mistakenly declares a winning or losing variation in an A/A test, Stats Engine will eventually correct itself to call the test inconclusive (statistical significance below 90%).
What this means when interpreting a regular A/B experiment
Always pay attention to the "statistical significance" confidence level in your tests, and be skeptical of implementing variations that don't reach your chosen significance level.
With Stats Engine, we provide an accurate representation of the likelihood of error, regardless of when you look at your results. You can determine the likelihood of making a false declaration by subtracting your current statistical significance from 100 (so if your variation is at 92%, you have an 8% chance of a false declaration).
Ultimately, you may need to find the happy medium between declaring winning and losing variations with high confidence (thus requiring more visitors) and the opportunity cost of being able to run more experiments.
Looking at the conversion rate difference interval can be very helpful when interpreting the results and when deciding how to act on them. For example:
If you see overlapping difference intervals in your variations, there is a chance that the conversion rates for both variations are actually the same. For example, you have two variations with conversion rates of 25% and 26%, respectively, and a confidence interval of +/- 1.5%. This could mean there is a chance that the conversion rates for both variations are actually the same, because the confidence intervals show potential for similar conversion rates between 24.5% and 26.6%
The range is a confidence interval with coverage equal to your project-level significance threshold (for example, 90% significance threshold <-> 90% confidence interval). This means that if this experiment were repeated, the true conversion rate difference would fall within that range 90% of the time. Or, in other words, there is a 90% chance that the true conversion rate difference falls into the difference interval reported on your current test.
If you let an A/A test run for a long period of time, you will see that the lines in the graph continue to cross each other as more data comes in. At different points in the test results, different variations might be declared as the winner.
As more and more data comes in for an A/B test, the difference interval between your variations will shrink. This increases the certainty of the actual conversion difference between variation and baseline.