- Optimizely X Web Experimentation
- Optimizely X Web Personalization
- Optimizely X Web Recommendations
THIS ARTICLE WILL HELP YOU:
- Use confidence intervals and improvement intervals to analyze results
- Predict what behavior you should see from your results over time
Statistical significance tells you whether a variation is outperforming or underperforming the baseline, at whatever level of confidence you chose. Confidence intervals tell you the uncertainty around improvement. Stats Engine provides a range of values where the conversion rate for a particular experience actually lies. It starts out wide -- as Stats Engine collects more data, the interval narrows to show that certainty is increasing.
Once a variation reaches statistical significance, the confidence interval always lies entirely above or below 0.
A winning variation will have a confidence interval that is completely above 0%.
An inconclusive variation will have a confidence interval that includes 0%.
A losing variation will have a confidence interval that is completely below 0%.
Optimizely sets your confidence interval at the same level that you set your statistical significance threshold for the project. For example, if you accept 90% significance to declare a winner, you also accept 90% confidence that the interval is accurate.
Example: "Winning" interval
In the example shown above, you can say that there is a 99% chance that the improvement you saw in the bottom variation is not due to chance. But the improvement Optimizely measured (+38.4%) may not be the exact improvement you see going forward.
In reality, if you implement that variation instead of the original, the difference in conversion rate will probably be between 22.01% and 54.79% over the baseline conversion rate. Compared to a baseline conversion rate of 34.80%, you're likely to see your variation convert in the range between 56.81% (34.80 + 22.01) and 89.59% (34.80 + 54.79).
Although the statistical significance is 97%, there's a 90% chance that the actual results will fall in the range of the confidence interval. This is because the statistical significance setting for your project is 90%: the probability that your confidence interval won't change as your variation's observed statistical significance changes. Rather, you'll generally see it become narrower as Optimizely collects more data.
In this experiment, the observed difference between the original (34.80%) and variation (48.17%) was 13.37%, which is not within the confidence interval. If we run this experiment again, we'll probably find that the difference between the baseline and the variation conversion rate is not in the same range.
Example: "Losing" interval
Let's look at another example, this time with the confidence interval entirely below 0.
In this example, you can say that there is a 99% chance that the negative improvement you saw in the bottom variation is not due to chance. The improvement Optimizely measured (-19.85%) is likely to be what you see going forward.
In reality, the difference in conversion rate will probably be between -27.37% and -12.34% under the baseline conversion rate if you implement the variation instead of the original. Compared to a baseline conversion rate of 62.90%, you're likely to see your variation convert in the range between 35.53% (62.90 - 27.37) and 50.56% (62.90 - 12.34).
In this experiment, the observed difference between the original (62.90%) and variation (50.41%) was -12.49%, which is just within the confidence interval. If we run this experiment again, the difference between the the baseline and variation conversion rate will probably be in the same range.
Example: Inconclusive interval
If you need to stop a test early or have a low sample size, the confidence interval will give you a rough idea of whether implementing that variation will have a positive or negative impact.
For this reason, when you see low statistical significance on certain goals, the confidence interval can serve as another data point to help you make decisions. When you have an inconclusive goal, the interval will look like this:
Here, we can say that the difference in conversion rates for this variation will be between -1.35% and 71.20%. In other words, it could be either positive or negative.
When implementing this variation, you can say, "We implemented a test result that we are 90% confident is better than -1.35% worse, but not more than 71.20% better," which allows you to make a business decision about whether implementing that variation would be worthwhile.
Another way you can interpret the confidence interval is as worst case, middle ground, and best case scenarios. For example, we are 90% confident that the worst case absolute difference between variation and baseline conversion rates is -1.35%, the best case is 71.20%, and a middle ground is 34.93%.
How statistical significance and confidence intervals are connected
Optimizely shows you the statistical likelihood that the improvement is due to changes you made on the page, not chance. Until Stats Engine has enough data to declare statistical significance, the Results page will state that more visitors are needed and show you an estimated wait time based on the current conversion rate.
Lower significance levels may increase the likelihood of error but can also help you test more hypotheses and iterate faster. Higher significance levels decrease the error probability, but require a larger sample.
Choosing the right significance level should balance the types of tests you are running, the confidence you want to have in the tests, and the amount of traffic you actually receive.
To reduce confusion, the Results page in Optimizely X shows the relative difference between variation and baseline measurement, not the absolute difference. This is true for all metrics, regardless of whether they are binary conversions or numeric.
In Optimizely X, an improvement interval of 1% to 10% means that the variation sees between 1% and 10% improvement over baseline. For example, if the baseline conversion rate is 25%, you can expect the variation conversion rate to fall between 25.25% and 27.5%.
Note that significance and confidence intervals are still connected in the same way: your experiment reaches significance at exactly the same time that your confidence interval on improvement moves away from zero.
Estimated wait time and <1% significance
As your experiment or campaign runs, Optimizely estimates how long it will take for a test to reach conclusiveness.
This estimate is calculated based on the current, observed baseline and variation conversion rates. If those rates change, the estimate will adjust automatically.
You may see a significance of less than %1, with a certain number of "visitors remaining." What does this mean? In statistical terms, this experiment is currently underpowered: Optimizely needs to gather more evidence to determine whether the change you see is a true difference in visitor behaviors, or chance.
Look at the variation in the example shown above. Optimizely needs approximately 11,283 more visitors to be exposed to that variation before it can decide on the difference in conversion rates between the variation and the original. Remember that the estimate of 11,283 visitors assumes that the observed conversion rate doesn't fluctuate. If more visitors see the variation but conversions decrease, your experiment will probably take more time, which means the "visitors remaining" estimate will increase. If conversions increase, Optimizely will need fewer visitors to be certain that the change in behavior is real.
To learn more about the importance of sample size, see our article on how long to run a test.
Unlike many testing tools, Optimizely's stats engine uses a statistical approach that removes the need to decide on a sample size and minimum detectable effect (MDE) before starting a test. You don't have to commit to large sample sizes ahead of time, and you can check on results whenever you want!
However, many optimization programs estimate how long tests take to run so they can build robust roadmaps. Use our sample size calculator to estimate how many visitors you'll need for a given test. Learn more about choosing a minimum detectable effect for our calculator in this article.
Learn more about statistical significance in Optimizely