- Optimizely X Web Experimentation
- Optimizely X Web Personalization
- Optimizely X Web Recommendations
- Optimizely Classic
THIS ARTICLE WILL HELP YOU:
- Understand Optimizely's Stats Engine, its calculations, and how it affects your results
- Distinguish between our Stats Engine and other methodologies
- Predict what behavior you should see from your results over time
- Make business decisions based on the results you see
When you run experiments, Optimizely determines the statistical likelihood of each variation actually leading to more conversions on your goals.
Why does this matter? Because when you look at your results, you’re probably less interested in seeing how a variation compared to the baseline and more interested in predicting whether a variation will be better than baseline when implemented in the future. In other words, you want to make sure your experiment results pay off.
Optimizely’s Stats Engine powers our statistical significance calculations. It uses a statistical framework that is optimized to enable experimenters to run experiments with high statistical rigor while making it easy for anyone to interpret results. Specifically, Stats Engine allows you to make business decisions on results as experiments are running, regardless of preset sample sizes and the number of goals and variations in an experiment.
As with all statistical calculations, it is impossible to predict a variation's lift with certainty. This is why our Results page displays Optimizely’s level of confidence in the results that you see. This way, you can make sophisticated business decisions from your results without an expert level of statistical knowledge.
Optimizely is the first platform to offer this powerful but easy-to-understand statistical methodology. Other statistics frameworks don’t make this so easy—learn more in the tip below.
Looking for more information about Optimizely's Stats Engine? Here are a few resources:
- Session on statistical concepts for experimentation with co-founder Pete Koomen (video)
- Blog post explaining the Stats Engine
- Technical whitepaper on the statistical model
- E-book on statistics for online experiments
- Why Stats Engine controls for false discovery instead of false positives
Or, learn more about the Stats Engine from our Optimizely Academy.
Interpret Stats Engine to make business decisions
The goal of Stats Engine is to make interpreting results as easy as possible for anyone, at any time. Open your Results page and look at the Statistical Significance column for your goal.
If this number is higher than your desired significance level (which is set to 90% by default), you can call a winner or loser based on whether the variation has positive or negative improvement.
Stats Engine provides an accurate representation of statistical significance any time an experiment is running, which means you can make decisions on results with lower or higher significance levels.
The default statistical significance setting, 90%, won’t be right for every organization. The right setting depends on factors like velocity, traffic levels, and risk tolerance. We encourage you to run experiments with statistical standards that match you business needs. Just make sure to account for your significance level as you make business decisions based on your results.
For instance, if you have lower traffic levels, it might make sense to run experiments with a statistical significance setting of 80% so you can run more experiments. Check out our article about the statistical significance setting to learn more.
When you run an experiment with many variations and metrics, there’s a greater chance that some of them will give false positive results. Stats Engine uses false discovery rate control to address this issue and reduce your chance of making an incorrect business decision or implementing a false positive among conclusive results. As a result, the more metrics you add to an experiment, the more conservative Stats Engine becomes. To learn how Stats Engine prioritizes primary and secondary metrics and monitoring goals, see Stats Engine approach to metrics and goals.
How statistical significance increases over time
Optimizely’s Stats Engine uses sequential experimentation, not the fixed-horizon experiments that you would see in other platforms. This means that instead of the statistical significance fluctuating over time, it should generally increase over time as Optimizely collects more evidence. Stronger evidence progressively increases your statistical significance.
Optimizely collects two main forms of conclusive evidence as time goes on:
Larger conversion rate differences
Conversion rate differences that persist over more visitors
The weight of this evidence depends on time. Early in an experiment, when your sample size is still low, large deviations between conversion rates are treated more conservatively than when your experiment has a larger number of visitors. At this point, you'll see a Statistical Significance line that starts flat, but increases sharply as Optimizely begins to collect evidence.
In a controlled environment, you should expect a stepwise, always-increasing behavior for statistical significance. When the statistical significance increases sharply, you’re seeing the experiment accumulate more conclusive evidence than it had before. Conversely, during the flat periods, the Stats Engine is not finding additional conclusive evidence beyond what it already knew about your experiments.
Below, you'll see how Optimizely collects evidence over time and displays it on the Results page. The area circled in red is the "flat" line you would expect to see early in an experiment.
When statistical significance crosses your accepted threshold for statistical significance, we will declare a winner or loser based the direction of the improvement. Learn more about the stepwise increase in our community discussion on step-wise increase.
Corrections due to external events
In a controlled environment, Optimizely's Stats Engine will provide a statistical significance calculation that is always increasing. However, experiments in the real world are not a controlled environment, and variables can change mid-experiment. Our analysis shows that this happens rarely, in only about 4% of experiments.
If this happens, the Stats Engine may lower its statistical significance calculation. If the statistical significance lowers, it is because Optimizely has seen evidence strong enough to support one of these possibilities:
We saw a run of data that looked significant, but now we have enough additional information to say that it probably isn't.
There was an underlying change in the environment that required us to be more conservative.
Traffic allocation was changed while the experiment was running, which can cause problems with the accuracy of results.
How Optimizely deals with revenue
In a nutshell, Stats Engine works as intended for revenue-per-visitor goals. You can look at your results any time and get an accurate assessment of your error rates on winners and losers, as well as difference intervals on the average revenue per visitor (RPV).
Testing for a difference in average revenue between a variation and baseline is more challenging than testing for a difference in conversion rates. This is because revenue distributions tend to be heavily right-tailed, or skewed. This skewness impedes the distributional results that many techniques rely on, including t-tests and Stats Engine. The practical implication is that they end up having less power or are less able to detect differences in average revenue when those differences actually exist.
Through a method called skew correction, Optimizely’s Stats Engine regains some of this lost power when testing revenue (or any continuously valued goal). We explicitly designed skew corrections to work well with all other aspects of Stats Engine.
This will affect you in two ways:
Detecting differences in average revenue is more reasonable for the types of visitor counts that Optimizely customers regularly see in A/B tests.
Confidence intervals for continuously-valued goals are no longer symmetric about their currently observed effect size. The underlying skewness of the distributions are now correctly factored into the shape of the confidence interval.