- Understand Optimizely's Stats Engine, its calculations, and how it affects your results
- Explain the difference between our Stats Engine and other methodologies
- Use statistical significance and difference intervals to analyze results
- Predict what behavior you should see from your results over time
- Make business decisions based on the results you see
What is Optimizely Stats Engine and how is it different from other statistical models?
When you run experiments, Optimizely determines the statistical likelihood of each variation actually leading to more conversions on your goals.
Why does this matter? Because when you look at your results, you’re probably less interested in seeing how a variation compared to the baseline, and more interested in predicting whether or not a variation will be better than baseline when implemented in the future. In other words, you want to make sure your experiment results pay off.
Optimizely’s Stats Engine powers our statistical significance calculations. It uses a statistical framework that is optimized to enable experimenters to run tests with high statistical rigor while making it easy for anyone to interpret results. Specifically, Stats Engine will allow customers to make business decisions on results as tests are running, regardless of preset sample sizes and the number of goals and variations in a test.
As with all statistical calculations, it is impossible to predict a variation's lift with certainty. This is why our Results page displays Optimizely’s level of confidence in the results that you see. This way, you can make sophisticated business decisions from your results without an expert level of statistical knowledge.
Optimizely is the first platform to offer this powerful but easy-to-understand statistical methodology. You can learn more about why other statistics frameworks don’t make this easy in the tip below.
Tip: Looking for more information about Optimizely's Stats Engine? Check out our...
Stats on the Results Page
Optimizely won't declare winners and losers until you have at least 100 visitors and 25 conversions on each variation, and more commonly you'll see results once Optimizely has determined that they are statistically significant. What's that? Read on, or watch this short video.
Statistical significance represents that likelihood that the difference in conversion rates between a given variation and the baseline is not due to chance. Your statistical significance level reflects your risk tolerance and confidence level.
For example, if your results are significant at a 90% significance level, then you can say that you are 90% confident that the results you see are due to an actual underlying change in behavior, not just random chance.
Why is this necessary? Because, in statistics, you observe a sample of the population and use it to make inferences about the total population. In Optimizely, this is used to infer whether your variation caused movement in the Improvement metric.
There's always a chance that the lift you observed was a result of typical fluctuation in conversion rates, instead of actual change in underlying behavior. For example, if you set a 80% significance level, and you see a winning variation, there’s a 20% chance that what you’re seeing is not actually a winning variation. At a 90% significance level, the chance of error decreases to 10%. The higher your significance, the more visitors your experiment will require. The highest significance that Optimizely will display is >99%, as it is technically impossible for results to be 100% significant.
Statistical significance helps us to control the rate of errors in experiments. In any controlled experiment, you should anticipate three possible outcomes:
- Accurate Results. When there is an underlying, positive (negative) difference between your original and your variation, the data shows a winner (loser), and when there isn’t a difference, the data shows an inconclusive result.
- False Positive. Your test data shows a significant difference between your original and your variation, but it’s actually random noise in the data—there is no underlying difference between your original and your variation.
- False Negative. Your test shows an inconclusive result, but your variation is actually different from your baseline.
Statistical significance is a measure of how likely it is that your Improvement comes from an actual change in underlying behavior, instead of a false positive.
By default, we set significance at 90%, meaning that there’s a 90% chance that the observed effect is real, and not due to chance. In other words, you will declare 9 out of 10 winning or losing variations correctly. If you would like to use a different significance threshold, you can set a significance level at which you would like Optimizely to declare winners and losers for your project.
Lower significance levels may increase the likelihood of error but can also enable customers to test more hypotheses and iterate faster. Higher significance levels decrease the error probability but require a larger sample.
Choosing the right significance level should balance the types of tests you are running, the confidence you would like to have in these tests, and the amount of traffic you actually receive.
False Discovery Rate control
Every test has a chance of reporting a false positive -- in other words, reporting a conclusive result when in reality there’s no underlying difference in behavior between the two variations. You can calculate the rate of error for a given test as 100 - [statistical significance]. This means that higher statistical significance numbers decrease the rate of false positives.
Using traditional statistics, you increase your exposure to false positives as you test many goals and variations at once (the “multiple comparisons” or “multiple testing problem.”) This happens because traditional statistics controls the false positive rate among all goals and variations. Yet this rate of error does not match the chance of making an incorrect business decision, or implementing a false positive among conclusive results. Below, we illustrate how this risk increases as you add goals and variations:
In the above illustration, there are 9 truly inconclusive results, and 1 false winner, resulting in an overall false positive rate of about 10%. However, the business decision you'll make is to implement the winning variations, not the inconclusive ones. The rate of error of implementing a false positive from the winning variations is 1 out of 2, or 50%. This is known as the proportion of false discoveries.
Optimizely controls errors, and the risk of incorrect business decisions, by controlling the False Discovery Rate instead of False Positive Rate. We define error rate as: False Discovery Rate = Average # incorrect winning and losing declarations / total # winning and losing declarations. Read more about the distinction between False Positive Rate and False Discovery Rate in our blog post.
We do not recommend adding a goal or variation after you’ve started an experiment. While early on this will unlikely have an effect, as you see more and more traffic there is a higher chance that adding a new goal or variation will impact your existing results.
Optimizely makes sure that the goal you choose as your primary goal always has the highest statistical power, by treating it specially in our False Discovery Rate control calculations. Our False Discovery Rate control protects the integrity of all your goals from the "multiple testing problem" when you add several goals and variations to your experiment, without slowing down your primary goal's significance. Learn more about how to set your primary goal in our Goals article, and how to optimize your goals for fast significance with Stats Engine here.
False Discovery Rates in Optimizely X
We've updated False Discovery Rates in Optimizely X to better match the diverse approaches our customers take to run experiments. In the previous section, we discussed how your chance of making an incorrect business decision increases as you add more goals and variations. This is true, but it's not the whole story.
Consider an experiment with seven (7) goals: one (1) headline metric that determines success of your experiment; four (4) secondary metrics tracking supplemental information; and two (2) diagnostic metrics used for debugging. These metrics aren't all equally important -- and statistical significance isn't as meaningful for some (the diagnostic metrics) as it is for others (the headline metric).
Yet the False Discovery Rate in Optimizely Classic treats them all as equals. Diagnostic metrics increase time to significance on other metrics just as much as other metrics impact the diagnostics.
In Optimizely X, we solve this problem by allowing you to rank your metrics. The first ranked metric is still your primary metric. Metrics ranked 2-5 can be considered secondary. Secondary metrics take longer to reach significance as you add more of them, but they don't impact the primary metric's speed to significance. Finally, any metrics ranked beyond the first five are diagnostic. Diagnostic metrics take longer to reach significance if there are more of them, but have minimal impact on secondary metrics, and no impact on the primary metric.
The result is that your chance of making a mistake on your primary metric is controlled, and the False Discovery Rate of all other metrics is controlled as well, all while prioritizing fast significance on the metrics that matter most.
Statistical significance tells you whether a variation is outperforming or underperforming the baseline, at some level of confidence. Difference intervals tell you the range of values where the difference between the original and the variation actually lies, after removing typical fluctuation.
The difference interval is a confidence interval of the conversion rates that you can expect to see when implementing a given variation. Think of it as your "margin of error" on the absolute difference between two conversion rates.
When a variation reaches statistical significance, its difference interval lies entirely above (winning variation) or below (losing variation) 0%.
- A winning variation will have a difference interval that is completely above 0%.
- An inconclusive variation will have a difference interval that includes 0%.
- A losing variation will have a difference interval that is completely below 0%.
Optimizely sets your difference interval at the same level that you set your statistical significance threshold for the project. So if you accept 90% significance to declare a winner, you also accept 90% confidence that the interval is accurate.
Note that the difference interval represents absolute conversion rate, not relative conversion rate. So, in other words, if your baseline conversion rate was 10% and your variation conversion rate was 11%, then:
- The absolute difference in conversion rates was 1%
- The relative difference in conversion rates was 10% - this is what Optimizely calls Improvement
In the difference interval, you will see a range that contains 1%, not 10%.
Example: A "Winning" Interval
In the example above, you can say that there is a 97% chance that the improvement you saw in the bottom variation is not due to chance. But the improvement Optimizely measured (+15.6%) may not be the exact improvement you see ongoing.
In reality, the difference in conversion rate will likely be between .29% and 4.33% over the baseline conversion rate if you were to implement that variation instead of the original. So, compared to a baseline conversion rate of 14.81%, you're likely to see your variation convert in the range between 15.1% (14.81 + .29) and 19.14% (14.81 + 4.33).
Even though the statistical significance is 97%, there's still a 90% chance that the actual results will fall in the range of the difference interval -- this is because the Statistical Significance Setting for your project is set to 90%. In other words, the probability of your difference interval won't change as your variation's observed statistical significance changes. Rather, you'll generally see it get narrower as Optimizely collects more data.
In this experiment, the observed difference between the original (14.81%) and variation (17.12%) was 2.31%, which is within the difference interval. If we were to rerun this experiment, we would likely find the difference between the the baseline and variation conversion rate to be in the same range.
Example: A "Losing" Interval
Let's look at another example, this time with the difference interval entirely below 0.
In the example above, you can say that there is a 91% chance that the negative improvement you saw in the bottom variation is not due to chance. But the improvement Optimizely measured (-21.9%) may not be exactly what you see ongoing.
In reality, the difference in conversion rate will likely be between -2.41% and -1.03% under the baseline conversion rate if you were to implement that variation instead of the original. So, compared to a baseline conversion rate of 7.86%, you're likely to see your variation convert in the range between 5.45% (7.86 - 2.41) and 6.83% (7.86 - 1.03).
In this experiment, the observed difference between the original (7.86%) and variation (6.14%) was -1.72%, which is within the difference interval. If we were to rerun this experiment, we would likely find the difference between the the baseline and variation conversion rate to be in the same range.
Example: An Inconclusive Interval
If you need to stop a test early or have a low sample size, the difference interval will tell you (roughly) whether implementing that variation will have a positive or negative impact.
For this reason, when you see low statistical significance on certain goals, the difference interval can serve as another data point to help you make decisions. When you have an inconclusive goal, the interval will look like this:
Here, we can say that the difference in conversion rates for this variation will be between -0.58% and 3.78% -- in other words, it could be positive or negative; Optimizely doesn't know yet.
When implementing this variation, you can say, "We implemented a test result that we are 90% confident is better than .58% worse, but not more than 3.78% better," which allows you to make a business decision about whether implementing that variation would be worthwhile.
Another way you can interpret the difference interval is as worst case / middle ground / best case scenarios. For example, we are 90% confident that the worst case absolute difference between variation and baseline conversion rates is -0.58%, the best case is 3.78%, and a middle ground is 1.6%.
Connection between Statistical Significance and Difference Intervals
As we mentioned above, there is a 90% chance that the underlying difference between baseline and variation conversion rates, the one that remains when you remove typical fluctuation, will fall in the range of the difference interval. This is because the Statistical Significance Setting for your project is set to 90%. If we wanted to be more confident that the underlying conversion rate difference fell into the range of the difference interval, we would widen the difference interval. This is done by raising the Statistical Significance Setting.
Higher levels of the Statistical Significance Setting correspond to wider difference intervals and a higher chance that the interval contains the underlying difference, and visa versa.
In fact, there’s an even deeper connection going on. Since there is a 90% chance the underlying difference (after removing random fluctuation) lies within the difference interval, there is a 10% chance it does not. So when we have a difference interval that is completely to the left or right of 0, we know that there is at most 10% chance the underlying conversion rate difference is 0. We are at least 90% confident that the underlying difference is not zero, or equivalently, that what we observed is not due to random fluctuation. But this is exactly how we described statistical significance!
In conclusion, both calling winners and losers, and the width of the confidence interval is controlled by the Statistical Significance Setting. And a winner (or loser) is called at the same time as the difference interval is completely to the right (or left) of 0.
Improvement Intervals in Optimizely X
In order to reduce confusion, the Results page in Optimizely X shows the relative difference between variation and baseline measurement, instead of the absolute difference. This is true for all metrics, regardless of whether they are binary conversions or numeric.
In Optimizely X, an improvement interval of 1% to 10% means that the variation sees between 1% and 10% improvement over baseline. For example, if baseline conversion rate is 25%, then we expect the variation conversion rate to fall between 25.25% and 27.5%.
Note that significance and confidence intervals are still connected in the same way. So, your experiment reaches significance at exactly the same time that your confidence interval on improvement moves away from 0.
Estimated wait time and <1% significance
As your experiment or campaign runs, Optimizely provides an estimate of how long you'll need to wait for a test to reach conclusiveness.
This estimate is calculated based on the current, observed baseline and variation conversion rates. If those rates change, the estimate will adjust automatically.
Sometimes, you may see a large lift in the Improvement column, but <1% significance. You'll also see a certain number of "visitors remaining." What does this mean?
In Statistics terms, this experiment is currently "underpowered": a relatively small number of visitors have entered the experiment, and Optimizely needs to gather more evidence to determine whether the change you see is a true difference in visitor behaviors, or chance. If you look in the Unique Conversions column, you'll probably see relatively low numbers.
Let's look at the "Search Only" variation in the example above. Currently, Optimizely needs ~6,500 more visitors to be exposed that variation before it can decide on the difference in conversion rates between the "Search Only" and "Original" variations. Keep in mind that the estimate of ~6,500 visitors assumes that the observed conversion rate doesn't fluctuate. If more visitors see the variation but conversions decrease, your experiment will likely take more time -- the "visitors remaining" estimate will increase. If conversions increase, Optimizely requires fewer visitors to be certain that the change in behavior is real.
To learn more about the importance of sample size, see our article on how long to run a test.
Unlike many testing tools, Optimizely's Stats Engine uses a statistical approach that removes the need to decide on a sample size and minimum detectable effect (MDE) before starting a test. You don't have to commit to large sample sizes ahead of time, and you can check on results whenever you want!
However, many optimization programs estimate how long tests take to run so they can build robust roadmaps. Use our Sample Size Calculator to estimate how many visitors you'll need for a given test. Learn more about choosing a minimum detectable effect for our calculator in this article.
Interpreting Stats Engine to make business decisions
The goal of Stats Engine is to make interpreting results as easy as possible for anyone, at any time. Open your results page and look at the Statistical Significance column for your goal.
If this number is over your desired significance level (by default, 90%), you can call a winner or loser based on whether the variation has positive or negative improvement.
Stats Engine provides an accurate representation of statistical significance anytime as an experiment is running, enabling you to make decisions on results with lower or higher significance levels.
Optimizely’s default significance level is 90%, but that won’t be right for every organization depending on your velocity, traffic levels, and risk tolerance. We encourage customers to run experiments with statistical standards that match their business needs -- just be sure to account for your significance level as you make business decisions based on your results.
For instance, if you have lower traffic levels it might make sense to run tests with a statistical significance of 80% so you can run more experiments.
How statistical significance increases over time
Remember that Optimizely’s Stats Engine uses sequential testing, instead of the fixed-horizon tests that you would see in other platforms. This means that instead of seeing statistical significance fluctuate over time, it should generally increase over time, as Optimizely collects evidence. Stronger evidence progressively increases your statistical significance.
Optimizely collects two main forms of conclusive evidence as time goes on:
- Larger conversion rate differences
- Conversion rate differences that persist over more visitors
The weight of this evidence depends on time. Early in an experiment, when your sample size is still low, large deviations between conversion rates are treated more conservatively than when your experiment has a larger number of visitors. At this point, you'll see a Statistical Significance line that starts flat, then increases sharply once Optimizely begins to collect evidence.
In a controlled environment, you should expect to see a stepwise, always-increasing behavior for statistical significance. When you see statistical significance increase sharply, you’re seeing the test accumulate more conclusive evidence than it had before. Conversely, during the flat periods, Stats Engine is not finding additional conclusive evidence beyond what it already knew about your test.
Below, you'll see how Optimizely collects evidence over time and displays it on the Results page. The red circled area is the "flat" line you would expect to see early in an experiment.
Once statistical significance crosses your accepted threshold for statistical significance (90%, by default), we will declare a winner or loser based the direction of the improvement. Learn more about the stepwise increase in our community discussion on step-wise increase.
Corrections due to external events
In a controlled environment, Stats Engine will provide a statistical significance calculation that is always increasing. However, experiments in the real world are not a controlled environment, and variables can change mid-experiment. Our analysis shows that this happens rarely: only ~4% of tests.
If this happens, Stats Engine may lower its statistical significance calculation. If the statistical significance lowers, it is because Optimizely has seen evidence strong enough to support one of two possibilities:
- We saw a run of data that looked significant but now have enough additional information to say that it likely is not
- There was an underlying change in the environment that required us to be more conservative
- You changed your traffic allocation while the experiment was running (this can cause problems with the accuracy of your results)
How does Optimizely deal with revenue?
In a nutshell, Stats Engine works as intended for Revenue Per Visitor goals. You can look at your results at any time and get an accurate assessment of your error rates on winners and losers, as well as difference intervals on the average revenue per visitor (RPV).
Testing for a difference in average revenue between a variation and baseline is more challenging than testing for a difference in conversion rates. The reason is that revenue distributions tend to be heavily right tailed, or skewed. This skewness impedes the distributional results that many techniques, including both T-tests and Stats Engine, rely on. The practical implication is that they end up having less power, or less ability to detect differences in average revenue when there actually is one.
Through a method called skew correction, Optimizely’s Stats Engine is able to regain some of this lost power when testing revenue, or any continuously valued goal for that matter. We have explicitly designed skew corrections to work well with all other aspects of Stats Engine.
This will affect you in two main ways:
- Detecting differences in average revenue is more reasonable for the types of visitors counts that Optimizely customers regularly see in A/B tests.
- Confidence intervals for continuously valued goals are no longer symmetric about their currently observed effect size. The underlying skewness of the distributions are now correctly factored into the shape of the confidence interval.
Does Optimizely use 1-tailed or 2-tailed tests?
When you run a test, you can run a 1-tailed or 2-tailed test. 2-tailed tests are designed to detect differences between your original and your variation in both directions -- it will tell you if your variation is a winner and it will also tell you if your variation is a loser. 1-tailed tests are designed to detect differences between your original and your variation in only one direction.
At Optimizely, we formerly used 1-tailed tests. With the introduction of Optimizely Stats Engine, we have switched to 2-tailed tests, because they are necessary for the False Discovery Rate Control that we have implemented in Optimizely Stats Engine.
In reality, False Discovery Rate Control is more important to your ability to make business decisions than whether you use a 1-tailed or 2-tailed test, because when it comes to making business decisions, your main goal is not to implement a false positive or negative.
Switching from a 2 to a 1 tailed test will typically change error rates by a factor of 2, but requires the additional overhead of specifying whether you are looking for winners or losers in advance. So if you knew you were looking for a winner, you could increase your significance from 90% to 95%. On the other hand, as the example above shows, not using false discovery rates can easily inflate error rates by a factor of 5 or more.
It’s more helpful to know the actual chance of implementing false results, and to make sure that your results aren’t compromised by adding multiple goals.