This article will help you:
  • Understand Optimizely's Stats Engine, its calculations, and how it affects your results
  • Explain the difference between our Stats Engine and other methodologies
  • Use statistical significance and difference intervals to analyze results 
  • Predict what behavior you should see from your results over time
  • Make business decisions based on the results you see

What is Optimizely Stats Engine and how is it different from other statistical models?

When you run experiments, Optimizely determines the statistical likelihood of each variation actually leading to more conversions on your goals.

Why does this matter? Because when you look at your results, you’re probably less interested in seeing how a variation compared to the baseline, and more interested in predicting whether or not a variation will be better than baseline when implemented in the future. In other words, you want to make sure your experiment results pay off.

Optimizely’s Stats Engine powers our statistical significance calculations. It uses a statistical framework that is optimized to enable experimenters to run tests with high statistical rigor while making it easy for anyone to interpret results. Specifically, Stats Engine will allow customers to make business decisions on results as tests are running, regardless of preset sample sizes and the number of goals and variations in a test.

As with all statistical calculations, it is impossible to predict a variation's lift with certainty. This is why our Results page displays Optimizely’s level of confidence in the results that you see. This way, you can make sophisticated business decisions from your results without an expert level of statistical knowledge.

Optimizely is the first platform to offer this powerful but easy-to-understand statistical methodology. You can learn more about why other statistics frameworks don’t make this easy in the tip below.

Stats on the Results Page

Statistical Significance

Optimizely won't declare winners and losers until you have at least 100 visitors and 25 conversions on each variation, and more commonly you'll see results once Optimizely has determined that they are statistically significant. What's that? Read on, or watch this short video.

 

Statistical significance represents that likelihood that the difference in conversion rates between a given variation and the baseline is not due to chance. Your statistical significance level reflects your risk tolerance and confidence level.

For example, if your results are significant at a 90% significance level, then you can say that you are 90% confident that the results you see are due to an actual underlying change in behavior, not just random chance.

Why is this necessary? Because, in statistics, you observe a sample of the population and use it to make inferences about the total population. In Optimizely, this is used to infer whether your variation caused movement in the Improvement metric.

There's always a chance that the lift you observed was a result of typical fluctuation in conversion rates, instead of actual change in underlying behavior. For example, if you set a 80% significance level, and you see a winning variation, there’s a 20% chance that what you’re seeing is not actually a winning variation. At a 90% significance level, the chance of error decreases to 10%. The higher your significance, the more visitors your experiment will require. The highest significance that Optimizely will display is >99%, as it is technically impossible for results to be 100% significant.

Statistical significance helps us to control the rate of errors in experiments. In any controlled experiment, you should anticipate three possible outcomes:

  • Accurate Results. When there is an underlying, positive (negative) difference between your original and your variation, the data shows a winner (loser), and when there isn’t a difference, the data shows an inconclusive result.
  • False Positive. Your test data shows a significant difference between your original and your variation, but it’s actually random noise in the data—there is no underlying difference between your original and your variation.
  • False Negative. Your test shows an inconclusive result, but your variation is actually different from your baseline.

Statistical significance is a measure of how likely it is that your Improvement comes from an actual change in underlying behavior, instead of a false positive.

By default, we set significance at 90%, meaning that there’s a 90% chance that the observed effect is real, and not due to chance. In other words, you will declare 9 out of 10 winning or losing variations correctly. If you would like to use a different significance threshold, you can set a significance level at which you would like Optimizely to declare winners and losers for your project.

Lower significance levels may increase the likelihood of error but can also enable customers to test more hypotheses and iterate faster. Higher significance levels decrease the error probability but require a larger sample.

Choosing the right significance level should balance the types of tests you are running, the confidence you would like to have in these tests, and the amount of traffic you actually receive.

False Discovery Rate control

Every test has a chance of reporting a false positive -- in other words, reporting a conclusive result when in reality there’s no underlying difference in behavior between the two variations. You can calculate the rate of error for a given test as 100 - [statistical significance]. This means that higher statistical significance numbers decrease the rate of false positives.

Using traditional statistics, you increase your exposure to false positives as you test many goals and variations at once (the “multiple comparisons” or “multiple testing problem.”) This happens because traditional statistics controls the false positive rate among all goals and variations. Yet this rate of error does not match the chance of making an incorrect business decision, or implementing a false positive among conclusive results. Below, we illustrate how this risk increases as you add goals and variations: 

 

In the above illustration, there are 9 truly inconclusive results, and 1 false winner, resulting in an overall false positive rate of about 10%. However, the business decision you'll make is to implement the winning variations, not the inconclusive ones. The rate of error of implementing a false positive from the winning variations is 1 out of 2, or 50%. This is known as the proportion of false discoveries. 

Optimizely controls errors, and the risk of incorrect business decisions, by controlling the False Discovery Rate instead of False Positive Rate. We define error rate as: False Discovery Rate = Average # incorrect winning and losing declarations / total # winning and losing declarations. Read more about the distinction between False Positive Rate and False Discovery Rate in our blog post

 
Important:

We do not recommend adding a goal or variation after you’ve started an experiment. While early on this will unlikely have an effect, as you see more and more traffic there is a higher chance that adding a new goal or variation will impact your existing results.

 
Note:

Optimizely makes sure that the goal you choose as your primary goal always has the highest statistical power, by treating it specially in our False Discovery Rate control calculations. Our False Discovery Rate control protects the integrity of all your goals from the "multiple testing problem" when you add several goals and variations to your experiment, without slowing down your primary goal's significance. Learn more about how to set your primary goal in our Goals article, and how to optimize your goals for fast significance with Stats Engine here.

Difference Intervals

Statistical significance tells you whether a variation is outperforming or underperforming the baseline, at some level of confidence. Difference intervals tell you the range of values where the difference between the original and the variation actually lies, after removing typical fluctuation.

 

The difference interval is a confidence interval of the conversion rates that you can expect to see when implementing a given variation. Think of it as your "margin of error" on the absolute difference between two conversion rates.

When a variation reaches statistical significance, its difference interval lies entirely above (winning variation) or below (losing variation) 0%.

  • A winning variation will have a difference interval that is completely above 0%.
  • An inconclusive variation will have a difference interval that includes 0%.
  • A losing variation will have a difference interval that is completely below 0%.

Optimizely sets your difference interval at the same level that you set your statistical significance threshold for the project. So if you accept 90% significance to declare a winner, you also accept 90% confidence that the interval is accurate.

Note that the difference interval represents absolute conversion rate, not relative conversion rate. So, in other words, if your baseline conversion rate was 10% and your variation conversion rate was 11%, then:

  • The absolute difference in conversion rates was 1%
  • The relative difference in conversion rates was 10% - this is what Optimizely calls Improvement

In the difference interval, you will see a range that contains 1%, not 10%.

Example: A "Winning" Interval

In the example above, you can say that there is a 97% chance that the improvement you saw in the bottom variation is not due to chance. But the improvement Optimizely measured (+15.6%) may not be the exact improvement you see ongoing.

In reality, the difference in conversion rate will likely be between .29% and 4.33% over the baseline conversion rate if you were to implement that variation instead of the original. So, compared to a baseline conversion rate of 14.81%, you're likely to see your variation convert in the range between 15.1% (14.81 + .29) and 19.14% (14.81 + 4.33).

Even though the statistical significance is 97%, there's still a 90% chance that the actual results will fall in the range of the difference interval -- this is because the Statistical Significance Setting for your project is set to 90%. In other words, the probability of your difference interval won't change as your variation's observed statistical significance changes. Rather, you'll generally see it get narrower as Optimizely collects more data.

In this experiment, the observed difference between the original (14.81%) and variation (17.12%) was 2.31%, which is within the difference interval. If we were to rerun this experiment, we would likely find the difference between the the baseline and variation conversion rate to be in the same range.

Example: A "Losing" Interval

Let's look at another example, this time with the difference interval entirely below 0.

In the example above, you can say that there is a 91% chance that the negative improvement you saw in the bottom variation is not due to chance. But the improvement Optimizely measured (-21.9%) may not be exactly what you see ongoing.

In reality, the difference in conversion rate will likely be between -2.41% and -1.03% under the baseline conversion rate if you were to implement that variation instead of the original. So, compared to a baseline conversion rate of 7.86%, you're likely to see your variation convert in the range between 5.45% (7.86 - 2.41) and 6.83% (7.86 - 1.03).

In this experiment, the observed difference between the original (7.86%) and variation (6.14%) was -1.72%, which is within the difference interval. If we were to rerun this experiment, we would likely find the difference between the the baseline and variation conversion rate to be in the same range.

Example: An Inconclusive Interval

If you need to stop a test early or have a low sample size, the difference interval will tell you (roughly) whether implementing that variation will have a positive or negative impact.

For this reason, when you see low statistical significance on certain goals, the difference interval can serve as another data point to help you make decisions. When you have an inconclusive goal, the interval will look like this:

Here, we can say that the difference in conversion rates for this variation will be between -0.58% and 3.78% -- in other words, it could be positive or negative; Optimizely doesn't know yet.

When implementing this variation, you can say, "We implemented a test result that we are 90% confident is better than .58% worse, but not more than 3.78% better," which allows you to make a business decision about whether implementing that variation would be worthwhile.

Another way you can interpret the difference interval is as worst case / middle ground / best case scenarios. For example, we are 90% confident that the worst case absolute difference between variation and baseline conversion rates is -0.58%, the best case is 3.78%, and a middle ground is 1.6%.

Connection between Statistical Significance and Difference Intervals

As we mentioned above, there is a 90% chance that the underlying difference between baseline and variation conversion rates, the one that remains when you remove typical fluctuation, will fall in the range of the difference interval. This is because the Statistical Significance Setting for your project is set to 90%. If we wanted to be more confident that the underlying conversion rate difference fell into the range of the difference interval, we would widen the difference interval. This is done by raising the Statistical Significance Setting. 

Higher levels of the Statistical Significance Setting correspond to wider difference intervals and a higher chance that the interval contains the underlying difference, and visa versa.

In fact, there’s an even deeper connection going on. Since there is a 90% chance the underlying difference (after removing random fluctuation) lies within the difference interval, there is a 10% chance it does not. So when we have a difference interval that is completely to the left or right of 0, we know that there is at most 10% chance the underlying conversion rate difference is 0. We are at least 90% confident that the underlying difference is not zero, or equivalently, that what we observed is not due to random fluctuation. But this is exactly how we described statistical significance!

In conclusion, both calling winners and losers, and the width of the confidence interval is controlled by the Statistical Significance Setting. And a winner (or loser) is called at the same time as the difference interval is completely to the right (or left) of 0.

Estimated wait time and <1% significance

As your test is running, Optimizely will also provide an estimate of how long you'll need to wait for a test to reach conclusiveness, assuming that the baseline and variation conversion rates don't change from their currently observed values. This is the wait time you can expect on average, so individual results may vary, but likely will not be far off.

As you look at your results, you may notice that you see a large percentage in the Improvement column, but it also shows <1% significance and a certain number of "visitors remaining." Why would you see this? If you look at your Unique Conversions for that variation, you'll likely notice the number is quite low. In Statistics terms, this test is "underpowered": with only a few visitors, Optimizely lacks enough evidence to determine whether the effect you're seeing is due to a real difference between the original and the variation, or pure chance. As the number of visitors increases to 40% or even 50% of the total visitors needed to reach a significant conclusion, you will see Statistical Significance begin to increase.

To make a determination about the difference between conversion rates for a variation and the baseline, Optimizely still needs about 300 visitors to be exposed to that variation. Keep in mind that the 300 visitor estimate assumes the improvement doesn't fluctuate. If improvement decreases (increases) as you get more data, your experiment will likely take more (fewer) visitors, and the visitors remaining estimate will adjust accordingly.

To learn more about the importance of sample size, see our article on how long to run a test.

 
Tip:

What happened to the Sample Size Calculator? Previously in Optimizely (and still, with the statistics supported by many testing tools), you needed to set a sample size and Minimum Detectable Effect before you start your test. With Stats Engine, we have removed the need to lock yourself in ahead of time, allowing you to run more tests by not committing yourself to large sample sizes ahead of time!

However, for many testing programs, it is important to estimate how long tests might take to run ahead of time. We have created a new sample size calculator based on our stats engine that will allow you to estimate how long a test will take to run, on average. We recommend using this calculator and choosing an MDE that will represent the longest you are willing to wait. If you happen to underestimate the improvement of the experiment, you will be able to stop the test earlier.

Interpreting Stats Engine to make business decisions

The goal of Stats Engine is to make interpreting results as easy as possible for anyone, at any time. Open your results page and look at the Statistical Significance column for your goal.

If this number is over your desired significance level (by default, 90%), you can call a winner or loser based on whether the variation has positive or negative improvement.

Stats Engine provides an accurate representation of statistical significance anytime as an experiment is running, enabling you to make decisions on results with lower or higher significance levels.

Optimizely’s default significance level is 90%, but that won’t be right for every organization depending on your velocity, traffic levels, and risk tolerance. We encourage customers to run experiments with statistical standards that match their business needs -- just be sure to account for your significance level as you make business decisions based on your results.

For instance, if you have lower traffic levels it might make sense to run tests with a statistical significance of 80% so you can run more experiments.

Here's 

How statistical significance increases over time

Remember that Optimizely’s Stats Engine uses sequential testing, instead of the fixed-horizon tests that you would see in other platforms. This means that instead of seeing statistical significance fluctuate over time, it should generally increase over time, as Optimizely collects evidence. Stronger evidence progressively increases your statistical significance.

Optimizely collects two main forms of conclusive evidence as time goes on:

  • Larger conversion rate differences
  • Conversion rate differences that persist over more visitors

The weight of this evidence depends on time. Early in an experiment, when your sample size is still low, large deviations between conversion rates are treated more conservatively than when your experiment has a larger number of visitors. At this point, you'll see a Statistical Significance line that starts flat, then increases sharply once Optimizely begins to collect evidence.

In a controlled environment, you should expect to see a stepwise, always-increasing behavior for statistical significance. When you see statistical significance increase sharply, you’re seeing the test accumulate more conclusive evidence than it had before. Conversely, during the flat periods, Stats Engine is not finding additional conclusive evidence beyond what it already knew about your test.

Below, you'll see how Optimizely collects evidence over time and displays it on the Results page. The red circled area is the "flat" line you would expect to see early in an experiment.

Once statistical significance crosses your accepted threshold for statistical significance (90%, by default), we will declare a winner or loser based the direction of the improvement. Learn more about the stepwise increase in our community discussion on step-wise increase.

Corrections due to external events

In a controlled environment, Stats Engine will provide a statistical significance calculation that is always increasing. However, experiments in the real world are not a controlled environment, and variables can change mid-experiment. Our analysis shows that this happens rarely: only ~4% of tests.

If this happens, Stats Engine may lower its statistical significance calculation. If the statistical significance lowers, it is because Optimizely has seen evidence strong enough to support one of two possibilities:

  • We saw a run of data that looked significant but now have enough additional information to say that it likely is not
  • There was an underlying change in the environment that required us to be more conservative
  • You changed your traffic allocation while the experiment was running (this can cause problems with the accuracy of your results)

How does Optimizely deal with revenue?

In a nutshell, Stats Engine works as intended for Revenue Per Visitor goals. You can look at your results at any time and get an accurate assessment of your error rates on winners and losers, as well as difference intervals on the average revenue per visitor (RPV).

Testing for a difference in average revenue between a variation and baseline is more challenging than testing for a difference in conversion rates. The reason is that revenue distributions tend to be heavily right tailed, or skewed. This skewness impedes the distributional results that many techniques, including both T-tests and Stats Engine, rely on. The practical implication is that they end up having less power, or less ability to detect differences in average revenue when there actually is one.

Through a method called skew correction, Optimizely’s Stats Engine is able to regain some of this lost power when testing revenue, or any continuously valued goal for that matter. We have explicitly designed skew corrections to work well with all other aspects of Stats Engine.

This will affect you in two main ways:

  • Detecting differences in average revenue is more reasonable for the types of visitors counts that Optimizely customers regularly see in A/B tests.
  • Confidence intervals for continuously valued goals are no longer symmetric about their currently observed effect size. The underlying skewness of the distributions are now correctly factored into the shape of the confidence interval.

Does Optimizely use 1-tailed or 2-tailed tests?

When you run a test, you can run a 1-tailed or 2-tailed test. 2-tailed tests are designed to detect differences between your original and your variation in both directions -- it will tell you if your variation is a winner and it will also tell you if your variation is a loser. 1-tailed tests are designed to detect differences between your original and your variation in only one direction.

At Optimizely, we formerly used 1-tailed tests. With the introduction of Optimizely Stats Engine, we have switched to 2-tailed tests, because they are necessary for the False Discovery Rate Control that we have implemented in Optimizely Stats Engine.

In reality, False Discovery Rate Control is more important to your ability to make business decisions than whether you use a 1-tailed or 2-tailed test, because when it comes to making business decisions, your main goal is not to implement a false positive or negative. 

Switching from a 2 to a 1 tailed test will typically change error rates by a factor of 2, but requires the additional overhead of specifying whether you are looking for winners or losers in advance. So if you knew you were looking for a winner, you could increase your significance from 90% to 95%. On the other hand, as the example above shows, not using false discovery rates can easily inflate error rates by a factor of 5 or more. 

It’s more helpful to know the actual chance of implementing false results, and to make sure that your results aren’t compromised by adding multiple goals.